This code is written in Python. The dependencies are:
- Python3 (with recent versions of NumPy and SciPy)
- Pytorch (tested on Pytorch 0.3.1)
- scikit-learn
- NLTK >= 3
- pycorenlp
- progressbar2
All models can be trained for 4 different evaluation tasks:
- 'class': 3-class classification (low, medium, high coherence)
- 'score_pred': mean score prediction
- 'perm': binary ranking of original vs. permuted texts (requires text permutation files)
- 'minority': binary classification of low coherence vs. all other texts
The GDCD data is available by request (see for details). To run the preprocessing scripts, you will have to create a directory for each corpus in 'data/' containing the train and test csv files. For the Yelp data, you will need to download the data separately ( and add the corresponding review titles and texts to the incomplete csv file (the CSV header should match the fields in the Clinton and Enron CSVs).
'corpus' refers to the corpus name: {Yahoo, Clinton, Enron, Yelp}
1) Extract texts from CSV to separate files. Required for entity grid and entity graph models, as well as generating text permutations for evaluation.
Input: data/[corpus]/[corpus]_train.csv and data/[corpus]/[corpus]_test.csv files. Output: data/[corpus]/text/ directory containing all individual text files.
python3 [corpus]
2) Generate permutation text files (20 per text). Only generates permutations for high-coherence texts (label = 3). Required for evaluating any model on the binary permutation ranking task (can skip this step for all other experiments).
Input: data/[corpus]/[corpus]_train.csv, data/[corpus]/[corpus]_test.csv, and data/[corpus]/text/ files. Output: data/[corpus]/text_permute directory containing original and permuted text files for all high-coherence texts.
python3 generate_high_coh_permutations [corpus]
3) Extract entity grid files (requires Stanford CoreNLP for parsing). Required for entity grid and entity graph models.
This step requires running the Stanford CoreNLP server (with Java 8, not Java 9). More details here: and here: You will probably need to run the server with -timeout 50000 (or possibly higher) instead of -timeout 15000 in order to process the longest documents in this dataset.
Original files only:
Input: data/[corpus]/text/ files. Output: data/[corpus]/parsed/ and data/[corpus]/grid/ files.
python3 [corpus]
Permuted files:
Input: data/[corpus]/text_permute/ files. Output: data/[corpus]/parsed_permute/ and data/[corpus]/grid_permute/ files
python3 [corpus]
4) Extract entity graph files from entity grid files. Extracts 6 different types of entity graphs: {unweighted, weighted, and syntax-sensitive} with or without distance discounting. Specify 'true' or 'false' for 'is_permutation' argument. Required for entity graph model.
Input: data/[corpus]/grid[_permute]/ files. Output: data/[corpus]/graph[_permute] files.
python3 [corpus] [is_permutation]
5) Extract features from entity grid files. Required for entity grid model. Must specify:
- 'seq_len' the number of sequential sentences over which to compute features (e.g. 2, 3, 4)
- 'salience_threshold' the threshold for salient vs. non-salient entities (e.g. 2, 3, 4 occurrences); specify '1' for only one saliance class
- 'syntax_opt' 1 to use syntactic roles (s, o, x, -); 0 to ignore syntactic roles (x, -)
- 'is_permutation': 'true' if using permuted text files, 'false' if using original text files only
Input: data/[corpus]/grid[_permute]/ files. Output: data/[corpus]/features[_permute]/[feature_set]
python3 [corpus] [seq_len] [salience_threshold] [syntax_opt] [is_permutation]
Train a random forest classifier on entity grid features. 'feature_set' specifies the name of the feature directory in data/[corpus]/features[_permute]. 'evaluation' specifies the task: 'class', 'score_pred', 'minority', 'perm'.
python3 [corpus] [feature_set] [evaluation]
Use entity graph outdegree values to evaluate on different tasks. Must specify graph type: [u, u_dist, w, w_dist, syn, syn_dist].
Thresholds (any real numbers):
- 'class': must specify 'threshold1' and 'threshold2'
- 'minority': must specify 'threshold1'
- 'perm': no threshold
- 'score_pred': no threshold
python3 [corpus] [evaluation] [graph_type] [opt:threshold1] [opt:threshold2]
Train 3-class classification model on Yahoo data with clique size = 7 sentences:
python3 --model_name yahoo_class_model --train_corpus Yahoo --model_type clique --task class --clique 7
See for other parameters.
Train 3-class classification model on Yahoo data:
python3 --model_name yahoo_class_model --train_corpus Yahoo --model_type sent_avg --task class
See for other parameters.
Note: the SentAvg model cannot be trained for the binary permutation ranking task (because all sentence order permutations have the same score).
Train 3-class classification model on Yahoo data:
python3 --model_name yahoo_class_model --train_corpus Yahoo --model_type par_seq --task class
See for other parameters.
Note: the ParSeq model currently cannot be trained for the binary permutation ranking task.