This is the official repository for our paper TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse .
Below, you will find instructions to reproduce our study. Feel free to contact us!
Computational approaches for detecting text reuse do not focus on capturing the change between the original context of the reused text and their re-contextualization. In this paper, we rely on the notion of topic relatedness and propose a framework called Topic Relatedness of Text Reuse (TRoTR) for evaluating the diachronic change of context in which text is reused. TRoTR includes two NLP tasks: Text Reuse in-Context (TRiC) and Topic variation Ranking across Corpus (TRaC). TRiC is designed to evaluate the topic relatedness between a pair of re-contextualizations. TRaC is designed to evaluate the overall topic variation within a set of re-contextualizations. We also provide a curated TRoTR benchmark of biblical text reuse, human-annotated with topic relatedness. The benchmark exhibits a inter-annotator agreement of .811, calculated by average pair-wise correlation on assigned judgments. Finally, we evaluate multiple, established Sentence-BERT models on the TRoTR tasks and find that they exhibit greater sensitivity to semantic similarity than topic relatedness. Our experiments show that fine-tuning these models can mitigate such a kind of sensitivity.
Ensure you have met the following requirements:
- Python 3.10.4
- Required Python packages (listed in
requirements.txt
)
To install the required packages, you can use pip:
pip install -r requirements.txt
The TRoTR
folder contains the data and labels of our benchmark.
TRoTR/raw_data.jsonl
contains the tweets manually collected by Twitter (now X).
We used the running version of the PhiTag annotation platform to display the guidelines, tutorial and data to annotators. Thus, our data adheres to the current format supported by PhiTag.
TRoTR/tutorial
contains the data used for training annotators in a 30-minute online session and their resulting judgments.
Notably, the tutorial data were excluded in our work.
TRoTR/guidelines.md
contains the instructions followed by human annotators.
For each target quotation
To convert the TRoTR/raw_data.jsnol
dataset into PhiTag format and implement random sampling of the context pairs
python src/random-sample.py
This creates the TRoTR/data
folder. The folder contains an additional sub-folder for each quotation (e.g. TRoTR/data/(1 Corinthians 13 4)
). These subfolders contain the data in PhiTag format. For the sake of simplicity, we also created two file that contains all context usages (i.e., PhiTag uses) and context pairs (i.e., PhiTag instantes) used in our work:
TRoTR/data/uses.csv
TRoTR/data/instances.csv
We divided the annotation process into four distinct rounds, each covering a different set of targets. This division was implemented manually and only for the purpose of conducting a quality check between consecutive rounds during the annotation process. PhiTag uses and instances for each round can be found in folders TRoTR/rounds
and TRoTR/judgments
, respectively.
After round 1, annotators were evaluated with a 30-minute online session on a subset of instances. We didn't consider these data in our benchmark. Anyway, you find the data in TRoTR/round/quality-check-1st-round.tsv
and TRoTR/judgments/quality-check-1st-round.tsv
.
To join the uses and judgments files of different rounds into two comphrensive files, we used the following command:
python src/merge-round.py
These produces two comphrensive files:
TRoTR/round/TRoTR.tsv
TRoTR/judgments/TRoTR.tsv
We computed inter-annotator agreements by using the stats+DURel.ipynb notebook
.
Our TRiC evaluation was conducted on 10 different Train-Test-Dev partitions. We can obtain the same 10 partitions by using the following command:
python src/cross_validation.py -s binary --n_folds 10
python src/cross_validation.py -s ranking --n_folds 10
This command results in 10 different sub-folders under the folder TRoTR/datasets/
. In particular, each folder contains the data in two formats:
- line-by-line: the contexts of a pair are represented one below the other on separate lines.
- pair-by-line: the contexts of a pair are represented in the same line.
We used format 2, but we make available the same data also in the format 1.
Both format sub-folders contain the data for the Train-Test-Dev splits.
Our TRaC evaluation was conducted on the full set of data. To generate ground truth data, we used the following command:
python src/topic_variation_scores.py
To generate benchmark data for TRaC, we used the following command:
python src/TRaC_dataset_generation.py
To fine-tune the models used in our study, you can use the specific bash script:
bash finetuning.sh
For evaluation results, we used three scripts:
src/TRiC-sBERT-BiEncoder.py
: for testing Bi-Encoder models on TRiCsrc/TRiC-sBERT-CrossEncoder.py
: for testing Cross-Encoder models on TRiCsrc/TRaC-sBERT-BiEncoder.py
: for testing Bi-Encoder models on TRaC
In particular, you can easily use the following bash command to call the three previous scripts and evaluate different models on both TRiC and TRaC.
bash sequence-model-evaluation.sh
This will create TRiC-stats.tsv
and TRaC-stats.tsv
files containing performance for different metrics and partitions.
@inproceedings{periti2024trotr,
title = {{TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse}},
author = "Periti, Francesco and Cassotti, Pierluigi and Montanelli, Stefano and Tahmasebi, Nina and Schlechtweg, Dominik",
editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.774",
pages = "13972--13990",
}