-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMDB Classifier #319
IMDB Classifier #319
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem is that this classifier didn't use any Forte data processors, how can we apply the data augmentation processors here. By just reading this example, it does not really belong to the Forte repo.
I added a pipeline that uses the Forte reader to preprocess data. |
The UDA Experiment PR #320 is based on this PR. Should I create a separate directory for the UDA experiment example, or should I just move the UDA changes to this PR? |
We discuss in our last meeting that we need to have all examples of DA in one folder. Can you coordinate and create that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very puzzled by this PR:
- Why are we using test set for dev set?
- Where are the back translation models?
There are a lot more comments inline.
@@ -1,3 +1,3 @@ | |||
python download_imdb.py | |||
python utils/imdb_format.py --raw_data_dir=data/IMDB_raw/aclImdb --train_id_path=data/IMDB_raw/train_id_list.txt --output_dir=data/IMDB | |||
python preprocess_pipeline.py | |||
python main.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line at the end.
@@ -6,7 +6,7 @@ | |||
# used for bert executor example | |||
max_batch_tokens = 128 | |||
|
|||
train_batch_size = 32 | |||
train_batch_size = 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have two copies of config data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in the model directory as an example of the expected parameters in config_data
. The user can simply copy this file if they want to use the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably only keep one to reduce maintenance effort.
@@ -74,13 +77,13 @@ def get_labels(self): | |||
raise NotImplementedError() | |||
|
|||
@classmethod | |||
def _read_tsv(cls, input_file, quotechar=None): | |||
def _read_tsv(cls, input_file, quotechar=None): # pylint: disable=unused-argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need all these read_tsv
after using our own reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our reader reads data from the raw data files (100k separate TXT files) and output to train.csv
and test.csv
. Then, our model reads these CSV files and generate pickle files for training, which is why we need read_tsv
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I am confusing. Why do we still need to go through the whole CSV step? We can directly read from our reader.
"""Run back translation.""" | ||
use_min_length = 10 | ||
use_max_length_diff_ratio = 0.5 | ||
logging.info("running bt augmentation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When doing logging, let's be specific. Done use the abbreviation.
import config_data | ||
import config_classifier | ||
|
||
from forte.models.imdb_text_classifier.model import IMDBClassifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which PR can I find this classifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the classifier for IMDB
only? if it is a general LSTM or CNN classifier we should consider renaming it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in this PR forte/models/imdb_text_classifier
. It is a BERT text classifier. The BERT model itself is not specific to IMDB but this PR contains preprocessing code specific to IMDB dataset to make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the preprocessing out from the core model?
import config_data | ||
import config_classifier | ||
|
||
from forte.models.imdb_text_classifier.model import IMDBClassifier | ||
|
||
|
||
def main(): | ||
model = IMDBClassifier(config_data, config_classifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the model is responsible for prepare_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model expects a pickle data format which is specific to the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a good idea to put all these into the model. This fix the model so that it can only do one thing.
for line in reader: | ||
lines.append(line) | ||
for line in f.readlines(): | ||
lines.append(line.split('\t')) | ||
return lines | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like clean_web_text
should be done in the reader. If this has already been done, you can remove this function.
for line in reader: | ||
lines.append(line) | ||
for line in f.readlines(): | ||
lines.append(line.split('\t')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a csv reader
instead of doing line.split
, this is not the correct way to read CSV
|
||
text_per_example = 1 | ||
|
||
with open(back_translation_file, encoding='utf-8') as inf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these back translation done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the code for the UDA experiment. I haven't included the UDA code in this PR (yet).
In the UDA experiment, the back translation should be generated by the user and output to a file. They can use their own back translation model or Forte's back translation model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, do you plan to merge that first? It can be a different PR
@@ -122,7 +125,7 @@ def get_train_examples(self, raw_data_dir): | |||
quotechar='"'), "train") | |||
|
|||
def get_dev_examples(self, raw_data_dir): | |||
"""See base class.""" | |||
"""The IMDB dataset does not have a dev set so we just use test set""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, how can you use the test set for dev purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IMDB dataset does not have a dev set. It is simply split into 25000 training examples and 25000 test examples.
The user should be aware of this. I can also remove this function if that makes it clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But when you are doing this experiment, what are we doing with the dev set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this function is not used? We probably should remove all unused function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove this. When training we simply look at the test set result.
One thing we can do to fix this PR is that @ziqian98 can help resolve many of the Forte parts. |
Codecov Report
@@ Coverage Diff @@
## master #319 +/- ##
=======================================
Coverage 80.03% 80.03%
=======================================
Files 163 163
Lines 10196 10196
=======================================
Hits 8160 8160
Misses 2036 2036 Continue to review full report at Codecov.
|
Closing this PR because we will use Ziqian's classifier as an IMDB classifier example. |
This PR fixes #293.
Description of changes
Added a text classifier for the IMDB large movie dataset based on Texar-PyTorch and BERT. The model expects CSV file inputs with columns (content, label, id).
Test Conducted
Added an example.