This repository contains the data and scripts for the paper ContraCAT: Contrastive Coreference Analytical Templates for Machine Translation.
If you just want the data for your own models and tests, you can find it under data
.
Go to scripts
to reproduce or tweak or data generation and evaluation method.
First install the requirements:
python3 -m venv YOUR_ENVIRONMENT_NAME
pip install -r requirements.txt
If you want to reproduce or modify our data generation, you also need 3 external resources:
- ContraPro
- GermaNet
- some form of mapping from German noun to gender
(Details for each are further down)
The data and scripts directory each contain two subfolders: adversarial_ContraPro and templates.
If you quickly want to look at commands to reproduce our data generation, go to commands.sh
.
The data for each adversarial attack consists of a file for English (en.txt
) and German (de.txt
).
Attacks of the types possessive-extension and synonym-replacement additionally contain a modified version of the original contrapro.json
, since only a subset of ContraPro was modified for these attacks. This means we only want to evaluate on that subset.
The following three example commands show how to create the data for the three types of attacks.
python3 -m scripts.adversarial_attack.phrase_addition -n true_separate -d "es ist wahr" -e "it is true" -a --end_punct .
python3 -m scripts.adversarial_attack.possessive_extension -n david -d "Davids" -e "David's" --de_prepend --en_prepend`
python3 -m scripts.adversarial_attack.synonym_replacement
The data for our ContraCAT templaes is under data/templates
and is further semantically divided into the different steps of coreference.
On the lowest level, each concrete template consists of en.txt
and de.txt
file, as well as a groundtruth file called gender.txt
.
To generate the template, you call scripts/generate_template.py
, e.g.:
python3 -m scripts.templates.generate -n 0_priors/verb -e "Wow! <PRO_NOM> <TRANS_VERB> it." -d "Wow! <PRO_NOM> <TRANS_VERB> <PRO_ACC_3_SIN>."
(for more details see commands.sh
)
To evaluate the attacks or templates, you need to have a scores-file such as the provided example_scores
.
Each line in that file corresponds to a line in the above mentioned en.txt
and de.txt
.
These scores are then evaluated with two scripts.
For adversarial attacks, we call the original ContraPro evaluation, e.g.:
python3 -m ContraPro.evaluate --reference PATH_TO_contrapro.json --scores scripts/example_scores
For possessive extension and synonym replacement the reference json-file has to be a subset,e.g.:
python3 -m ContraPro.evaluate --reference data/adversarial_ContraPro/synonyms/modified_contrapro_subset.json --scores scripts/example_scores
For the evaluation of templates, simply call our script scripts/templates/evaluate.py
.
- ContraPro: Simply clone the ContraPro repo into the project directory.
- GermaNet: this was needed to get a mapping from an English to a German synset (find access here). You will have to change the path in
config.py
if you download it. - noun-to-gender-mapping: we cannot provide this mapping and you will have to implement
get_gender_dict()
inutils.py
on your own.
Please cite the following paper:
@inproceedings{stojanovski-etal-2020-contracat,
title = "{C}ontra{CAT}: Contrastive Coreference Analytical Templates for Machine Translation",
author = "Stojanovski, Dario and
Krojer, Benno and
Peskov, Denis and
Fraser, Alexander",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.417",
pages = "4732--4749",
}