Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMDB Classifier #319

Closed
wants to merge 5 commits into from
Closed

IMDB Classifier #319

wants to merge 5 commits into from

Conversation

jrxk
Copy link
Collaborator

@jrxk jrxk commented Dec 1, 2020

This PR fixes #293.

Description of changes

Added a text classifier for the IMDB large movie dataset based on Texar-PyTorch and BERT. The model expects CSV file inputs with columns (content, label, id).

Test Conducted

Added an example.

@jrxk jrxk self-assigned this Dec 1, 2020
@jrxk jrxk requested a review from hunterhector December 1, 2020 18:40
@jrxk jrxk added data_aug Features on data augmentation topic: data Issue about data loader modules and data processing related labels Dec 1, 2020
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem is that this classifier didn't use any Forte data processors, how can we apply the data augmentation processors here. By just reading this example, it does not really belong to the Forte repo.

@jrxk
Copy link
Collaborator Author

jrxk commented Dec 14, 2020

The main problem is that this classifier didn't use any Forte data processors, how can we apply the data augmentation processors here. By just reading this example, it does not really belong to the Forte repo.

I added a pipeline that uses the Forte reader to preprocess data.

@jrxk jrxk closed this Dec 14, 2020
@jrxk jrxk reopened this Dec 14, 2020
@jrxk
Copy link
Collaborator Author

jrxk commented Dec 14, 2020

The UDA Experiment PR #320 is based on this PR. Should I create a separate directory for the UDA experiment example, or should I just move the UDA changes to this PR?

@hunterhector
Copy link
Member

The UDA Experiment PR #320 is based on this PR. Should I create a separate directory for the UDA experiment example, or should I just move the UDA changes to this PR?

We discuss in our last meeting that we need to have all examples of DA in one folder. Can you coordinate and create that?

Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very puzzled by this PR:

  1. Why are we using test set for dev set?
  2. Where are the back translation models?

There are a lot more comments inline.

@@ -1,3 +1,3 @@
python download_imdb.py
python utils/imdb_format.py --raw_data_dir=data/IMDB_raw/aclImdb --train_id_path=data/IMDB_raw/train_id_list.txt --output_dir=data/IMDB
python preprocess_pipeline.py
python main.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line at the end.

@@ -6,7 +6,7 @@
# used for bert executor example
max_batch_tokens = 128

train_batch_size = 32
train_batch_size = 24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have two copies of config data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in the model directory as an example of the expected parameters in config_data. The user can simply copy this file if they want to use the model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably only keep one to reduce maintenance effort.

@@ -74,13 +77,13 @@ def get_labels(self):
raise NotImplementedError()

@classmethod
def _read_tsv(cls, input_file, quotechar=None):
def _read_tsv(cls, input_file, quotechar=None): # pylint: disable=unused-argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need all these read_tsv after using our own reader?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our reader reads data from the raw data files (100k separate TXT files) and output to train.csv and test.csv. Then, our model reads these CSV files and generate pickle files for training, which is why we need read_tsv here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I am confusing. Why do we still need to go through the whole CSV step? We can directly read from our reader.

"""Run back translation."""
use_min_length = 10
use_max_length_diff_ratio = 0.5
logging.info("running bt augmentation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing logging, let's be specific. Done use the abbreviation.

import config_data
import config_classifier

from forte.models.imdb_text_classifier.model import IMDBClassifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which PR can I find this classifier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the classifier for IMDB only? if it is a general LSTM or CNN classifier we should consider renaming it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in this PR forte/models/imdb_text_classifier. It is a BERT text classifier. The BERT model itself is not specific to IMDB but this PR contains preprocessing code specific to IMDB dataset to make it work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the preprocessing out from the core model?

import config_data
import config_classifier

from forte.models.imdb_text_classifier.model import IMDBClassifier


def main():
model = IMDBClassifier(config_data, config_classifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the model is responsible for prepare_data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model expects a pickle data format which is specific to the model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good idea to put all these into the model. This fix the model so that it can only do one thing.

for line in reader:
lines.append(line)
for line in f.readlines():
lines.append(line.split('\t'))
return lines


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like clean_web_text should be done in the reader. If this has already been done, you can remove this function.

for line in reader:
lines.append(line)
for line in f.readlines():
lines.append(line.split('\t'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a csv reader instead of doing line.split, this is not the correct way to read CSV


text_per_example = 1

with open(back_translation_file, encoding='utf-8') as inf:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these back translation done?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the code for the UDA experiment. I haven't included the UDA code in this PR (yet).

In the UDA experiment, the back translation should be generated by the user and output to a file. They can use their own back translation model or Forte's back translation model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, do you plan to merge that first? It can be a different PR

@@ -122,7 +125,7 @@ def get_train_examples(self, raw_data_dir):
quotechar='"'), "train")

def get_dev_examples(self, raw_data_dir):
"""See base class."""
"""The IMDB dataset does not have a dev set so we just use test set"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, how can you use the test set for dev purpose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IMDB dataset does not have a dev set. It is simply split into 25000 training examples and 25000 test examples.

The user should be aware of this. I can also remove this function if that makes it clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when you are doing this experiment, what are we doing with the dev set?

Copy link
Member

@hunterhector hunterhector Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this function is not used? We probably should remove all unused function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this. When training we simply look at the test set result.

@hunterhector
Copy link
Member

One thing we can do to fix this PR is that @ziqian98 can help resolve many of the Forte parts.

@codecov
Copy link

codecov bot commented Dec 14, 2020

Codecov Report

Merging #319 (12342a6) into master (f243663) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #319   +/-   ##
=======================================
  Coverage   80.03%   80.03%           
=======================================
  Files         163      163           
  Lines       10196    10196           
=======================================
  Hits         8160     8160           
  Misses       2036     2036           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f243663...12342a6. Read the comment docs.

@jrxk
Copy link
Collaborator Author

jrxk commented Dec 19, 2020

Closing this PR because we will use Ziqian's classifier as an IMDB classifier example.

@jrxk jrxk closed this Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data_aug Features on data augmentation topic: data Issue about data loader modules and data processing related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a text classifier for the IMDB large movie dataset
2 participants