The KILT benchmark is described in the following paper:
@inproceedings{petroni2020kilt,
title={KILT: a Benchmark for Knowledge Intensive Language Tasks},
author={Fabio Petroni and Aleksandra Piktus and Angela Fan and Patrick Lewis and Majid Yazdani and Nicola De Cao and James Thorne and Yacine Jernite and Vassilis Plachouras and Tim Rockt{\"{a}}schel and Sebastian Riedel},
booktitle={arXiv:2009.02252},
year={2020}
}
https://arxiv.org/abs/2009.02252
conda create -n kilt37 -y python=3.7 && conda activate kilt37
pip install -r requirements.txt
The KILT knowledge source can be downloaded here: kilt_knowledgesource.json (34.76GiB).
It is based on the 2019/08/01 Wikipedia dump.
We use mongoDB to index the knowledge base (but you can use any json-based db).
To import the knowledge source in mongoDB run:
wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json
mongoimport --db kilt --collection knowledgesource --file kilt_knowledgesource.json
{
'wikipedia_title': 'Email marketing',
'wikipedia_id': 1101759,
'text': ['p1', 'p2',...., 'pn'], # list of paragraph text
'anchors': [{"text":,"href":,"paragraph_id":,"start":,"end":} ] ,
'categories': 'comma separated list of categories'
'history': # some info from wikipedia, including original url
'wikidata_info': # wikidata info
}
from kilt.knowledge_source import KnowledgeSource
# get the knowledge souce
ks = KnowledgeSource()
# count entries - 5903530
ks.get_num_pages()
# get page by id
page = ks.get_page_by_id(27097632)
# get pages by title
page = ks.get_page_by_title("Michael Jordan")
mkdir data
python scripts/donwload_all_kilt_data.py
python scripts/get_triviaqa_input.py
You can also download and use the KILT data through the HuggingFace's nlp library.
Note that we release only the input for the test sets, without answers. Test answers are used for the KILT challenge on EvalAI where participants can upload their models’ predictions and be listed on the public leaderboard (there are strict submission limits to discourage overfitting on test data).
{'id': # original data point id if available otherwise unique id
'input': # question / claim / sentence / etc
'output': [ # each element might contain an answer, a provenance or both
{
'answer': # answer in textual form
'provenance': [
# evidence set for the answer from the KILT ks
{
'wikipedia_id': # *mandatory*
'title':
'section':
'start_paragraph_id':
'start_character':
'end_paragraph_id':
'end_character':
'bleu_score': # wrt original evidence
'meta': # dataset/task specific
}
]
}
]
'meta': # dataset/task specific
}
* run python scripts/get_triviaqa_input.py
to get the question associated with each id
For Entity Linking, in addition to the AIDA CoNLL-YAGO train set, the whole knowledge source can be used as training data by exploiting hyperlinks. To facilitate experimentation, we release such data in KILT format following the splits of BLINK:
- blink-train-kilt.jsonl (9M lines)
- blink-dev-kilt.jsonl (10,000 lines)
We also provide a script to map the TAC-KBP 2010 dataset to the knowledge source and format of KILT.
Please follow this README.
If the module cannot be found, preface the python command with PYTHONPATH=.
If the experiments fail on GPU memory allocation, try reducing batch size.
KILT is MIT licensed. See the LICENSE file for details.