This repository contains the code for creating the CoherenceGym test suites introduced in our paper
Is Incoherence Surprising? Targeted Evaluation of Coherence Prediction from Language Models
It is based on the SynatxGym framework for evaluating pre-trained language models on syntactic phenomena and extends this line of work to notions of discourse and dialogue coherence.
Install SyntaxGym as described here (requires Python>=3.7)
The core of our work is the collection and creation of test suites that target discourse and dialogue coherence phenomena. We transform some existing test sets into the test suite format specified by SyntaxGym, and extend this set with three new suites. Each test suite contains different items in two versions each, where one is designed to be more coherent than the other. All test suites are based on existing corpora, from which we extract sentences or sentence pairs and introduce perturbations that a model which is able to encode different notions of coherence should find more surprising than the original version.
We include the commonly used sentence suffling approach to break coherence in an uncontrolled way.
For discourse data we use the ROCStories corpus. Access can be aquired free of charge here. We use the 2016 test set cloze_test_test__spring2016-cloze_test_ALL_test.tsv
The PersonaChat corpus is used to represent dialogue data and can be found under ParlAI/data/Persona-Chat/personachat
after installing ParlAI. We use test_both_original.txt
.
python de_coherify.py TODO
Another previously proposed task the requires a notion of coherence is the Story Cloze task of discriminating a wrong from a right ending for short (5 sentence) stories.
For the Story Cloze test suite, we use the same ROCStories data as described above.
python de_coherify.py TODO
We re-use the Winograd sentence pairs used by Trinh and Le (2019) and Radford et al. (2019) and transform them into the test suite format.
The original data can be found here under test_suites/original/wsc273.json
python de_coherify.py TODO
We extract sentence pairs from the ARRAU corpus and perturb the second mention of coreferent entities from Pronoun to NP repetition.
Access to the corpus has to be aquired through the LDC
python de_coherify.py TODO
The connectives test suite contains manipulations of explicit connectives.
The test suite is based on the Disco-Annotation corpus, which is available here
python de_coherify.py TODO
We construct a dialogue test suite on speaker commitment by combining two segments labeled as contradiction in the DialogueNLI dataset to test whether models are able to detect violations where same speakers contradict themselves vs. two different speakers uttering contradicting sequences.
The Dataset is available at https://wellecks.github.io/dialogue_nli/. We use the human verified portion of the test set in dialogue_nli_verified_test.jsonl
.
python de_coherify.py TODO
Models can be evaluated on the test suites using the syntaxGym pipeline as follows
syntaxgym run gpt2 /path/to/test_suite > gpt2_test-suite.results
syntaxgym run DialoGPT-medium /path/to/test_suite > dialogpt_test-suite.results
where the .results files will contain a per-item evaluation of whether the prediction specified in the test suite was met.
Currently, only the evaluation of GPT-2 is supported in the official installation of syntaxgym. The pull request for DialoGPT is pending and we are still trying to fix some compatibility issues with other models in lm-zoo
, on which SyntaxGym is based. Information on how to include DialoGPT locally can be found in models/Readme.md.
To evaluate on several models on all (or selected) test suites run
python eval.py {run,show,visualize}
.
The eval script acts in three different modes:
- run:
python eval.py run [-h] [--config CONFIG] output_directory
, which runs models on suites and saves the results to csv files. - show:
python eval.py show [-h] [--config CONFIG] {models,suites} [suite-type]
, which shows available models and suites. When showing suites, it will show directories containing types of suites (see below for how to setup suites in directories). Set suite_type to show individual suites of type. Showing models will list all models in lm_zoo, including the ones not made specifically for dialogue suites. You can still evaluate the models on the suites, but the results will be of questionable value. - visualize:
python eval.py visualize [-h] [--config CONFIG] [--outdir OUTDIR] inpath
, which creates js-tags to include in a website to visualize the results on the test suites.inpath
give the path to the output ofrun
. If nooutdir
is given, the script will show the visualizations on your device. I don't know what happens if you try that on a system with no GUI, like a server.
The script is governed by a config.ini
, which contains three items:
suitespath
... path to the available test suites. Suites must be organized in a directory, which contains subdirectories for each type of suite you want to use. The subdirectories than each contain suites in json-files.suitenames
... list of suites to use with one entry per line. Each entry can be either a type of suite (i.e. one of the subdirectory), the name of a particular suite, or a path (relative tosuitespath
) to a suite-file. If no suites are given, the script will evaluate on all suites insuitespath
.models
... list of models to use with one entry per line. Each entry must be the name of a model in lm_zoo, to be used with syntaxgym.
We further plan to add more models to lm-zoo
in order to evaluate the impact of different model sizes and architectures.
@inproceedings{beyer-etal-2021-incoherence,
title = "Is Incoherence Surprising? Targeted Evaluation of Coherence Prediction from Language Models",
author = "Beyer, Anne and Lo{\'a}iciga, Sharid and Schlangen, David",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.naacl-main.328",
pages = "4164--4173"
}