This is an easy to understand code for text classification using Yoon Kim's model written in pytorch.
Input data must be in three files:
- topicclass_train.txt
- topicclass_valid.txt
- topicclass_test.txt
Each file must contain the input examples with one line per example in the following format
<label> ||| <sentence>
for instance
Social sciences and society ||| Several of these rights regulate pre @-@ trial procedure : access to a non @-@ excessive bail , the right to indictment by a grand jury , the right to an information ( charging document ) , the right to a speedy trial , and the right to be tried in a specific venue .
We assume that the data is tokenized and we use python's split function to split it into tokens. This repository was tested on this dataset.
Some basic EDA is provided in this notebook.
After the data is in the correct format, fill the entries in the config file. A template is provided in the repo.
python run.py -model kim_cnn -lr 0.001 -drop_prob 0.5 -batch_size 4096 -cuda -use_trainable_embed -use_fixed_embed -gpu 0 -epochs 10
Expected accuracy is 85.5% on this dataset.
Please install the requirements file provided in the repo.