The repository contains code and data for ECIR 2020 paper "Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph" [pdf, slides, presentation].
KEWER embeddings trained on categories, literals, predicates structural components and unigram probabilities are available here: https://academictorrents.com/details/4778f904ca10f059eaaf27bdd61f7f7fc93abc6e.
KEWER allows to significantly improve entity retrieval for complex queries. Below are the top 10 results for the query "wonders of the ancient world" obtained using BM25F and KEWER. Relevant results are italicized, and highly relevant results are boldfaced.
BM25F | KEWER |
---|---|
Seven Wonders of the Ancient World | Colossus of Rhodes |
7 Wonders of the Ancient World (video game) | Statue of Zeus at Olympia |
Wonders of the World | Temple of Artemis |
Seven Ancient Wonders | List of archaeoastronomical sites by country |
The Seven Fabulous Wonders | Hanging Gardens of Babylon |
The Seven Wonders of the World (album) | Antikythera mechanism |
Times of India's list of seven wonders of India | Timeline of ancient history |
Lighthouse of Alexandria | Wonders of the World |
7 Wonders (board game) | Lighthouse of Alexandria |
Colossus of Rhodes | Great Pyramid of Giza |
To download the dataset, which is a subset of English DBpedia 2015-10, simply run make-dataset.sh
script.
Verify that it produced the following files and directories in dbpedia-2015-10-kewer
directory:
$ tree --dirsfirst dbpedia-2015-10-kewer
dbpedia-2015-10-kewer
├── graph
│ ├── infobox_properties_en.ttl
│ ├── mappingbased_literals_en.ttl
│ └── mappingbased_objects_en.ttl
├── labels
│ ├── anchor_text_en.ttl
│ ├── category_labels_en.ttl
│ ├── dbpedia_2015-10.nt
│ ├── infobox_property_definitions_en.ttl
│ └── labels_en.ttl
├── article_categories_en.ttl
├── short_abstracts_en.ttl
└── transitive_redirects_en.ttl
2 directories, 11 files
- Generate
indexed
file with the filtered entities:make-indexed.sh
. - Install required packages:
$ conda create --name kewer --file requirements.txt
$ conda activate kewer
- Train embeddings:
$ cd embeddings/KEWER
$ ./gen_graph.py
$ ./gen_walks.py --cat --outfile data/walks-cat.txt
$ ./replace_uris.py --pred --lit --infile data/walks-cat.txt --outfile data/sents-cat-pred-lit.txt
# optional - shuffle sentences: $ shuf data/sents-cat-pred-lit.txt -o data/sents-cat-pred-lit.txt
$ ./train_w2v.py --infile data/sents-cat-pred-lit.txt --outfiles data/kewer
bm25f/
scripts to optimize and run retrieval of BM25F baseline using Galago fork https://sourceforge.net/projects/galago-fork/. You need to provide index in index/
directory to run the scripts. For converting .ttl files into trecweb format that can be indexed with Galago, this project can be used https://github.com/teanalab/dbpedia2fields
embeddings/
scripts to train KEWER and Jointly baseline.
entity-extraction/
scripts to perform entity linking in queries using DBpedia Spotlight, Nordlys LTR, and SMAPH.
interpolation-el/
interpolation BM25F+KEWER_el-SM.
interpolation/
interpolation BM25F+KEWER.
qrels/
relevance judgments from DBpedia-Entity v2.
queries/
query folds in json and tsv formats.
retrieval/
ranking of entities using embeddings only, without interpolation, as in Table 2 in paper.
word2vec/
scripts for BM25F+word2vec baseline.
eval.sh
evaluate result runs using provided qrels_file.
make-dataset.sh
download DBpedia 2015-10 dataset.
make-indexed.sh
generate 'indexed' file with the filtered entities.
queries-v2_stopped.txt
DBpedia-Entity v2 queries.
@InProceedings{Nikolaev:2020:KEWER,
author="Nikolaev, Fedor and Kotov, Alexander",
title="Joint Word and Entity Embeddings for Entity Retrieval from a Knowledge Graph",
booktitle="Advances in Information Retrieval",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="141--155",
isbn="978-3-030-45439-5"
}
If you have any questions or suggestions, send an email to [email protected] or create a GitHub issue.