IMDB Classifier #319

jrxk · 2020-12-01T08:12:41Z

This PR fixes #293.

Description of changes

Added a text classifier for the IMDB large movie dataset based on Texar-PyTorch and BERT. The model expects CSV file inputs with columns (content, label, id).

Test Conducted

Added an example.

hunterhector

The main problem is that this classifier didn't use any Forte data processors, how can we apply the data augmentation processors here. By just reading this example, it does not really belong to the Forte repo.

forte/models/imdb_text_classifier/data/download_imdb.py

…ssifier_2

jrxk · 2020-12-14T20:05:25Z

The main problem is that this classifier didn't use any Forte data processors, how can we apply the data augmentation processors here. By just reading this example, it does not really belong to the Forte repo.

I added a pipeline that uses the Forte reader to preprocess data.

jrxk · 2020-12-14T20:09:25Z

The UDA Experiment PR #320 is based on this PR. Should I create a separate directory for the UDA experiment example, or should I just move the UDA changes to this PR?

hunterhector · 2020-12-14T20:37:19Z

The UDA Experiment PR #320 is based on this PR. Should I create a separate directory for the UDA experiment example, or should I just move the UDA changes to this PR?

We discuss in our last meeting that we need to have all examples of DA in one folder. Can you coordinate and create that?

hunterhector

I am very puzzled by this PR:

Why are we using test set for dev set?
Where are the back translation models?

There are a lot more comments inline.

hunterhector · 2020-12-14T20:38:36Z

examples/text_classification/run.sh

@@ -1,3 +1,3 @@
 python download_imdb.py
-python utils/imdb_format.py --raw_data_dir=data/IMDB_raw/aclImdb --train_id_path=data/IMDB_raw/train_id_list.txt --output_dir=data/IMDB
+python preprocess_pipeline.py
 python main.py


Add a new line at the end.

hunterhector · 2020-12-14T20:40:32Z

forte/models/imdb_text_classifier/config_data.py

@@ -6,7 +6,7 @@
 # used for bert executor example
 max_batch_tokens = 128

-train_batch_size = 32
+train_batch_size = 24


Why do we have two copies of config data?

I put it in the model directory as an example of the expected parameters in config_data. The user can simply copy this file if they want to use the model.

We should probably only keep one to reduce maintenance effort.

hunterhector · 2020-12-14T20:41:17Z

forte/models/imdb_text_classifier/utils/data_utils.py

@@ -74,13 +77,13 @@ def get_labels(self):
        raise NotImplementedError()

    @classmethod
-    def _read_tsv(cls, input_file, quotechar=None):
+    def _read_tsv(cls, input_file, quotechar=None):  # pylint: disable=unused-argument


Do we still need all these read_tsv after using our own reader?

Our reader reads data from the raw data files (100k separate TXT files) and output to train.csv and test.csv. Then, our model reads these CSV files and generate pickle files for training, which is why we need read_tsv here.

That's why I am confusing. Why do we still need to go through the whole CSV step? We can directly read from our reader.

hunterhector · 2020-12-14T20:42:31Z

forte/models/imdb_text_classifier/utils/data_utils.py

+    """Run back translation."""
+    use_min_length = 10
+    use_max_length_diff_ratio = 0.5
+    logging.info("running bt augmentation")


When doing logging, let's be specific. Done use the abbreviation.

hunterhector · 2020-12-14T20:46:40Z

examples/text_classification/main.py

 import config_data
 import config_classifier

+from forte.models.imdb_text_classifier.model import IMDBClassifier


In which PR can I find this classifier?

Is the classifier for IMDB only? if it is a general LSTM or CNN classifier we should consider renaming it.

It is in this PR forte/models/imdb_text_classifier. It is a BERT text classifier. The BERT model itself is not specific to IMDB but this PR contains preprocessing code specific to IMDB dataset to make it work.

Can you move the preprocessing out from the core model?

hunterhector · 2020-12-14T20:53:15Z

examples/text_classification/main.py

 import config_data
 import config_classifier

+from forte.models.imdb_text_classifier.model import IMDBClassifier
+

 def main():
    model = IMDBClassifier(config_data, config_classifier)


Why the model is responsible for prepare_data?

The model expects a pickle data format which is specific to the model.

It is not a good idea to put all these into the model. This fix the model so that it can only do one thing.

hunterhector · 2020-12-14T20:55:51Z

forte/models/imdb_text_classifier/utils/data_utils.py

-            for line in reader:
-                lines.append(line)
+            for line in f.readlines():
+                lines.append(line.split('\t'))
        return lines




Something like clean_web_text should be done in the reader. If this has already been done, you can remove this function.

hunterhector · 2020-12-14T20:56:33Z

forte/models/imdb_text_classifier/utils/data_utils.py

-            for line in reader:
-                lines.append(line)
+            for line in f.readlines():
+                lines.append(line.split('\t'))


Please use a csv reader instead of doing line.split, this is not the correct way to read CSV

hunterhector · 2020-12-14T20:59:00Z

forte/models/imdb_text_classifier/utils/data_utils.py

+
+    text_per_example = 1
+
+    with open(back_translation_file, encoding='utf-8') as inf:


How are these back translation done?

This is part of the code for the UDA experiment. I haven't included the UDA code in this PR (yet).

In the UDA experiment, the back translation should be generated by the user and output to a file. They can use their own back translation model or Forte's back translation model.

I see, do you plan to merge that first? It can be a different PR

hunterhector · 2020-12-14T20:59:56Z

forte/models/imdb_text_classifier/utils/data_utils.py

@@ -122,7 +125,7 @@ def get_train_examples(self, raw_data_dir):
                           quotechar='"'), "train")

    def get_dev_examples(self, raw_data_dir):
-        """See base class."""
+        """The IMDB dataset does not have a dev set so we just use test set"""


I don't understand, how can you use the test set for dev purpose?

The IMDB dataset does not have a dev set. It is simply split into 25000 training examples and 25000 test examples.

The user should be aware of this. I can also remove this function if that makes it clearer.

But when you are doing this experiment, what are we doing with the dev set?

Do you mean this function is not used? We probably should remove all unused function.

I can remove this. When training we simply look at the test set result.

hunterhector · 2020-12-14T21:31:25Z

One thing we can do to fix this PR is that @ziqian98 can help resolve many of the Forte parts.

codecov · 2020-12-14T22:24:18Z

Codecov Report

Merging #319 (12342a6) into master (f243663) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #319   +/-   ##
=======================================
  Coverage   80.03%   80.03%           
=======================================
  Files         163      163           
  Lines       10196    10196           
=======================================
  Hits         8160     8160           
  Misses       2036     2036

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f243663...12342a6. Read the comment docs.

jrxk · 2020-12-19T20:57:51Z

Closing this PR because we will use Ziqian's classifier as an IMDB classifier example.

jrxk added 2 commits December 1, 2020 03:02

create imdb classifier

67308ae

fix travis

0daf443

jrxk self-assigned this Dec 1, 2020

jrxk requested a review from hunterhector December 1, 2020 18:40

jrxk added data_aug Features on data augmentation topic: data Issue about data loader modules and data processing related labels Dec 1, 2020

Merge branch 'master' into imdb_classifier_2

68044a0

hunterhector requested changes Dec 9, 2020

View reviewed changes

hunterhector reviewed Dec 9, 2020

View reviewed changes

forte/models/imdb_text_classifier/data/download_imdb.py Show resolved Hide resolved

jrxk added 2 commits December 14, 2020 01:12

Merge branch 'master' of https://github.com/asyml/forte into imdb_cla…

2d29725

…ssifier_2

add forte pipeline, fix download

12342a6

jrxk closed this Dec 14, 2020

jrxk reopened this Dec 14, 2020

hunterhector requested changes Dec 14, 2020

View reviewed changes

jrxk closed this Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMDB Classifier #319

IMDB Classifier #319

jrxk commented Dec 1, 2020

hunterhector left a comment

jrxk commented Dec 14, 2020

jrxk commented Dec 14, 2020

hunterhector commented Dec 14, 2020

hunterhector left a comment

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

jrxk Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

jrxk Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

jrxk Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

jrxk Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

jrxk Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020

jrxk Dec 14, 2020

hunterhector Dec 14, 2020

hunterhector Dec 14, 2020 •

edited

Loading

jrxk Dec 14, 2020

hunterhector commented Dec 14, 2020

codecov bot commented Dec 14, 2020 •

edited

Loading

jrxk commented Dec 19, 2020


		text_per_example = 1

		with open(back_translation_file, encoding='utf-8') as inf:

IMDB Classifier #319

IMDB Classifier #319

Conversation

jrxk commented Dec 1, 2020

Description of changes

Test Conducted

hunterhector left a comment

Choose a reason for hiding this comment

jrxk commented Dec 14, 2020

jrxk commented Dec 14, 2020

hunterhector commented Dec 14, 2020

hunterhector left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hunterhector Dec 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hunterhector commented Dec 14, 2020

codecov bot commented Dec 14, 2020 • edited Loading

Codecov Report

jrxk commented Dec 19, 2020

hunterhector Dec 14, 2020 •

edited

Loading

codecov bot commented Dec 14, 2020 •

edited

Loading