This repistory contains the data, code, and paper for the Politicians Are Also People project by David Peter Süle, Mie Jonasson, and Nicklas Koch Rasmussen, originally made for the Introduction to Natural Language Processing — Second Year Project course at the IT University of Copenhagen.
What is the performance impact of clustering domain-specific named entity types in cross-domain relation-classification setups and what benchmark can be established for future research?
Relation Extraction is an evolving field within natural language processing. As its last step, Relation Classification (RC) aims to identify the relation type to which two semantically related named entities belong. Cross-domain setups are especially challenging, even more so when domain-specific entity types are used. Research is scarce in the area and mostly focuses on using generic entity types or simply fine-tune the model on a single target domain. This might still offer challenges when annotated data is not accessible for fine-tuning.
In this paper we explore ways of clustering domain-specific named entity types to reduce cross-domain complexity and improve performance on previously unseen domains. We propose five different methods of grouping entity types and evaluate them in multi-domain and out-of-domain scenarios using our two new benchmarks. In conclusion, we find that all our entity mapping methods outperform the baseline in the out-of-domain setting, with the best performing model improving on the baseline by
Our work relied heavily on the CrossRE project by Elisa Bassignana and Barbara Plank: CrossRE: A Cross-Domain Dataset for Relation Extraction (Bassignana & Plank, Findings 2022), and their repistory.
pip install -r requirements.txt
./run.sh
- data
- crossre_data
- The training-, development- and test-data as provided by the CrossRE project.
- predictions
- names of folders: DOMAIN-LIST_SEED where the domain list is abbreviated from the first letter of the domains used during training; contains predictions produced by running main script.
- ood_clustering_data: data for training with OOD clustering method.
- results: Aggregated results
- crossre_data
- figures: images / plots used for the report.
- src
- Scripts used for training. These are mainly supplied by the CrossRE project, with slight modifications.
- util
- Helper functions to check validity of results.
(Note: 'ood validation' stands for OOD evaluation and 'all' stands for the multi-domain results in the file names.)
@misc{politicians-are-people,
title = "Politicians Are Also People: Mapping Is All You Need",
author = "S{\"u}le, David Peter and Jonasson, Mie and Rasmussen, Nicklas Koch",
howpublished = "\url{https://github.com/davidsule/politicians_are_also_people}",
year = "2023",
school = "IT University of Copenhagen",
address = "Copenhagen, Denmark",
note = "Introduction to Natural Language Processing — Second Year Project course report",
abstract = "Relation Extraction is an evolving field within natural language processing. As its last step, Relation Classification (RC) aims to identify the relation type to which two semantically related named entities belong. Cross-domain setups are especially challenging, even more so when domain-specific entity types are used. Research is scarce in the area and mostly focuses on using generic entity types or simply fine-tune the model on a single target domain. This might still offer challenges when annotated data is not accessible for fine-tuning. In this paper we explore ways of clustering domain-specific named entity types to reduce cross-domain complexity and improve performance on previously unseen domains. We propose five different methods of grouping entity types and evaluate them in multi-domain and out-of-domain scenarios using our two new benchmarks. In conclusion, we find that all our entity mapping methods outperform the baseline in the out-of-domain setting, with the best performing model improving on the baseline by 8.6 percentage points in weighted F1.
}