- This repository is a Tensorflow implementation of R-NET, a neural network designed to solve the Question Answering (QA) task.
- This implementation is specifically designed for SQuAD , a large-scale dataset drawing attention in the field of QA recently.
- If you have any question, contact [email protected].
- Python 3.6
- Tensorflow-gpu 1.2.1
- Numpy 1.13.1
- NLTK
- First we need to download SQuAD as well as the pre-trained GloVe word embeddings. This should take roughly 30 minutes, depending on network speed.
cd Data
sh download.sh
cd ..
- Data preprocessing, including tokenizing and collection of pre-trained word embeddings, can take about 15 minutes.
Two kinds of files,
{data/shared}_{train/dev}.json
, will be generated and stored inData
.- shared: including the original and tokenized articles, GloVe word embeddings and character dictionaries.
- data: including the ID, corresponding article id, tokenized question and the answer indices.
python preprocess.py --gen_seq True
-
Train R-NET by simply executing the following. The program will
- Read the training data, and then build the model. This should take around an hour, depending on hardware.
- Train for 12 epochs, by default.
Hyper-arameters can be specified in
Models/config.json
. The training procedure, including the mean loss and mean EM score for each epoch, will be stored inResults/rnet_training_result.txt
. Note that the score appear during training could be lower than the scores from the official evaluator. The models will be stored inModels/save/
.
python rnet.py
- The evaluation of the model on the dev set can be generated by executing the following. The result will be stored in
Results/rnet_prediction.txt
. Note that the score appear during evaluation could be lower than the scores from the official evaluator.
python evaluate.py
- To get the final official score, you need to use the official evaluation script, which is in the
Results
directory.
python Results/evaluate-v1.1.py Data/dev-v1.1.json Results/rnet_prediction.txt
Model | Dev EM Score | Dev F1 Score |
---|---|---|
Original Paper | 71.1 | 79.5 |
My Implementation | 60.1 | 68.9 |
My Implementation(w/o char emb) | 57.8 | 67.9 |
You can find the current leaderboard and compare with other models.
As shown above, I still fail to reproduce the results. I think there are some technical details that draw my concern:
-
Data Preprocessing. I have tried two preprocessing approaches, one of which is used in the implementation of Match-LSTM, and the other is used in the implementation of Bi-DAF. While the latter approach includes lots of reasonable processing, I chose the former one empirically since it yields better performance.
-
No Dropout has yet been applied to my implementation. I am currently conducting experiments on this.
-
As pointed out in another implementation of R-NET in Keras,
The first formula in (11) of the report contains a strange summand
W_v^Q V_r^Q
. Both tensors are trainable and are not used anywhere else in the network. We have replaced this product with a single trainable vector.However, instead of replacing the product with a single trainable vector, I followed the notation and still used two vectors.
-
Variable sharing. The notation in the original paper was very confusing to me. For example,
W_v^P
appeared in both equations (4) and (8). In my opinion, they should not be the same since they are multiplied by vectors of total different spaces. As a result, I treat them as different variables empirically. -
Hyper-parameters ambiguity. Some hyper-paramters weren't specified in the original paper, including character embedding matrix dimension, truncating of articles and questions, and length of answer span during inference. I set up my own hyper-parameters empirically, mostly following the settings of Match-LSTM and Bi-DAF.
-
Any other implementation mistakes and bugs.
The full model could not be trained with NVIDIA Tesla K40m with 12GiB memory. Tensorflow will report serious OOM problem. There are a few possible solutions.
- Run with CPU. This can be achieved by assigning a device mask with command line as follows. In fact, my implementation result shown in the previous section was generated by a model trained with CPU. However, this might cause extremely slow training speed. In my experience, it might cost roughly 24 hours per epoch.
CUDA_VISIBLE_DEVICES="" python rnet.py
-
Reduce hyperparameters. Modifying these parameters might help:
p_length
- Word embedding dimension: change from 300d GloVe vectors to 100d.
-
Don't use character embeddings. To achieve this one might have to hack into
Models/models_rnet
. I'll try to make this a parameter inModels/config.json
but this feature won't be soon. According to Bi-DAF, character embeddings don't help much. However, Bi-DAF uses 1D-CNNs to generate the character embeddings, while R-NET uses RNNs. As shown in the previous section, the performance dropped for 2%. Further investigation is needed for this part.