In repository there is all the code used to run the experiments in our project report. We tackle the message polarity classification task, that is given a message decide whether it expresses negative, neutral or positive sentiment. We developed and validated the CNN from Zhang and Wallace, 2015 and compared the performance of this system with a new method of pre-training language representations, called BERT, which obtains state-of-the-art results on a wide array of NLP tasks. We performed 3 repetition of 5-fold-cv for each experiment.
All our experiments were run on a linux server with an nVIDIA Tesla K40 accelerated GPU, kindly provided to us by Professor Giuseppe Attardi @ UniPi.
The root of the project contains:
- the python script
run_cnn.py
, that implements the CNN and allow the user to choose between Cross-Validation and Test mode. See Scripts for further details. - the python script
run_bertft.py
, the same asrun_cnn.py
but implements both the fine-tuned BERT systems involved in our analysis. - the Jupyter Notebook
Data_Cleaning.ipynb
, contains the data pre-processing pipeline and the functions to export data, labels and the embedding matrix needed in the lookup layer of CNN the model. (the code in this notebook has to be executed BEFORE the python scripts) - the python script
run_classifier_.py
, adapted version of the orginal script in the BERT repo. - the folder
cv_result
, output folder for our scripts in cv mode. - the folder
results_test
, output folder for our script in test mode. - this README
Code is written in Python (3.6.8) and requires Keras (2.2.4), Tensorflow (1.13.1) and tweet-preprocessor (1.3.1).
Before running the scripts make sure that your data has been preprocessed as illustrated in Data_Cleaning.ipynb
.
For validating and assessing the risk of CNN we use run_cnn.py
, while for BERT run_bertft.py
.
Once you've created your train_data
, train_labels
and embedding_matrix*
files using the data pre-proc pipeline in Data_Cleaning.ipynb
, and put them in the data
folder, you can run the script in this way:
python run_cnn.py mtest k5,5 n100,100
the following table provides additional information on the parameters.
Name | Values | Description | Default Value |
---|---|---|---|
b | int > 0 | batch size | 32 |
e | int > 0 | number of epochs | 2 |
k | tuple | seq of int comma separated that specifies kernels size | 2,3,4 |
n | tuple | seq of int comma separated that specifies filters size | 100,100,100 |
x | string | suffix of the matrix build with the data cleaning pipeline | TW200 |
a | string | activation function | relu |
d | float >= 0 | dropout | 0.0 |
m | cv test |
if cv: script runs cross-validation else: script runs tests |
cv |
Example of output of the script in test mode: verbose mode is on during train and we print scores, for both train and test set, for the single classes(negative, neutral, positive) and averaged.
Epoch 1/2
21240/21240 [==============================] - 140s 7ms/step - loss: 0.8765 - categorical_accuracy: 0.5734
Epoch 2/2
21240/21240 [==============================] - 154s 7ms/step - loss: 0.5440 - categorical_accuracy: 0.7769
Scores on training
21240/21240 [==============================] - 16s 765us/step
Accuracy: 0.953436911465309
Mavg_recall: 0.9510301922854709
F1-score: 0.9475169784506192
Class F1 [0.9316843345111896, 0.9516968325791855, 0.9633496223900488]
Start TEST
**************************
TEST: 2013
3547/3547 [==============================] - 2s 678us/step
Accuracy: 0.41217930642255163
Mavg_recall: 0.34760255236915666
F1-score: 0.2617124634916851
Class F1 [0.13157894736842107, 0.4998584772148316, 0.39184597961494905]
To execute the script you must:
- download Bert repository and BERT-Base Uncased from google-research;
- put module
run_bertft.py
andrun_classifier_.py
into Bert directory.
All dataset (train and test) must be .tsv dataset where the last column is the tweet column and the penultimate column is the label column, as in the below example.
ID | Label | Tweet |
---|---|---|
32 | Positive | It's a beautiful day! |
Example of script execution:
python run_bertft.py mode=test seq_len=50 epochs=3 reps=2 fold=10
The script has some parameter with default value. The table below contains all the parameters you can change.
Name | Values | Description | Default value |
---|---|---|---|
mode | train test |
if train: the script runs cross_validation else: the script runs the tests |
train |
train | path | path of the train dataset | ./data/BERT_data/train/tweet_train_df.tsv |
test | path | path of the directory that contains test datasets | ./data/BERT_data/test |
softmax | 0 1 |
if 1 run BERT fine-tuning with softmax layer else run BERT fine-tuning with CNN |
1 |
batch_size | int > 0 | batch size | 32 |
seq_len | int > 0 | sequence length | 40 |
epochs | int > 0 | number of epochs | 2 |
reps | int > 0 | number of repetition of cross-validation | 3 |
fold | int > 0 | number of fold for cross-validation | 5 |
0 1 |
if 1 prints the recall and f1 scores for each class | 0 |
Example of output of our script in cv mode: for each fold we print validation scores and scores for each class.
....
*******[ 15 / 15 ]*******
....
***** Scores for each class *****
Accuracy: 0.6923258003766478
Recall: [0.65602322 0.62883087 0.76850306]
F1: [0.63261022 0.65291691 0.75197386]
***** Averaged Scores *****
Accuracy: 0.6923258003766478
Recall: 0.6844523855748061 ---- 0.6844523855748061
F1: 0.69229204013095
=========== REPS 3 RESULTS ===========
Accuracy: 0.699482109227872
Recall: 0.6884577812392527
F1: 0.6911314647836555
Ye Zhang has written a very nice paper doing an extensive analysis of model variants (e.g. filter widths, k-max pooling, word2vec vs Glove, etc.) and their effect on performance.
Experiment using test data from reruns of SemEval from 2013 up to 2017, running run_bertft.py
option mode = test achieves state of the art scores across multiple test sets. (e.g as reported in Rosenthal et al., 2017 in 2017 the top f1-score was 0.685, our system achives 0.694).