All datasets are located on /datasets
a volume exclusively for datasets like IR test collections, document corpora or other forms of data that is used in our research.
Dataset | Creator | Year | Size | Type | Usecase |
---|---|---|---|---|---|
AOL | G. Pass, A. Chowdhury, C. Torgeson | 2006 | 2,1G (zipped) | IR test collection | personalization, query reformulation or other types of search research |
semanticscholar | Waleed Ammar | 2019 | 46G (zipped) | document corpora | ad-hoc retrieval |
iSearch | Aalborg University | 2010 | 50G (zipped) | IR test collection | Integreated search and citation-based retrieval |
Washington Post | NIST | 2018 | 1.5G (zipped) | IR test collection | ad-hoc retrieval |
Washington Post (v4) | NIST | 2021 | 2.4G (zipped) | IR test collection | ad-hoc retrieval |
Tipster 1/2/3 | NIST | 1994 | 1.3G (zipped) | IR test collection | ad-hoc retrieval |
TREC Disks 4/5 | NIST | 1997 | 820MB (zipped) | document corpora | ad-hoc retrieval |
New York Times | Evan Sandhaus | 2008 | 1G (zipped) | document corpora | ad-hoc retrieval |
AQUAINT | David Graff | 2002 | 3G (zipped) | document corpora | ad-hoc retrieval |
GIRT4 | GESIS-IZ | 2006 | 110M (zipped) | IT test collection | ad-hoc retrieval, domain-specific, multilingual |
TripClick | Navid Rekab-saz, Oleg Lesota, Markus Schedl, Jon Brassey, Carsten Eickhoff | 2021 | 32.7G (zipped) | Click log dataset | ad-hoc retrieval, deep learning models |
Yahoo-L18 | Yahoo! Research | 2009/10 | 1.3G (zipped) | Click log dataset | ad-hoc retrieval, session analysis |
Yandex - Personalized Web Search Challenge | Eugene Kharitonov, Pavel Serdyukov | 2014 | 5.9G (zipped) | Click log dataset | ad-hoc retrieval, session analysis |
TREC-OpenSearch | TREC OpenSearch Organizers | 2016/17 | 600M (zipped) | Click log dataset | ad-hoc retrieval, session analysis |
- Login on
linux2
. - Create a new folder for the dataset and copy the
README.template.md
in the new folder. Rename the file toREADME.md
- Describe the data set along the template.
- Copy all files for the dataset to the folder and add all binary files and folder to
.gitignore
. - Commit the
README.md
and all the additional files you would like to see on GitHub. - Update this page to include a brief description of the dataset.