Skip to content

Deep Learning Hard (DL-HARD) is a new annotated dataset extending TREC Deep Learning benchmark.

Notifications You must be signed in to change notification settings

ahsmourad/DL-Hard

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


DL-Hard

Annotated Deep Learning Dataset For Passage and Document Retrieval

Table of Contents
  1. Paper
  2. Overview
  3. Dataset
  4. Change Log
  5. Hard Queries
  6. New Judgements
  7. Annotations
  8. Entity Links
  9. Evaluation
  10. Baselines
  11. Future Work

Paper

This work is published as a resource paper in SIGIR 2021. Link: link

@inproceedings{mackie2021dlhard,
 title={How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset},
 author={Mackie, Iain and Dalton, Jeffrey and Yates, Andrew},
 booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 year={2021}
}

Overview

Colab demo (Pyserini): link
Colab demo (PyTerrier): link

Deep Learning Hard (DL-HARD) is a new annotated dataset building upon standard deep learning benchmark evaluation datasets. It builds on TREC Deep Learning (DL) questions extensively annotated with query intent categories, answer types, wikified entities, topic categories, and result type metadata from a leading web search engine. Based on this data, we introduce a framework for identifying challenging questions. DL-HARD contains 50 queries from the official 2019/2020 evaluation benchmark, half of which are newly and independently assessed. We perform experiments using the official submitted runs to DL on DL-HARD and find substantial differences in metrics and the ranking of participating systems. Overall, DL-HARD is a new resource that promotes research on neural ranking methods by focusing on challenging and complex queries.

DL-Hard Diagram

Note, NIST judged/unjudged DL query counts in the diagram are approximated for simplicity (see track overview paper for specifics). Due to the differences in TREC DL task querysets, DL-HARD provides 25 new document judgments and 27 new passage judgments. Both DL-HARD tasks have 50 queries overall.

Dataset

DL-Hard provides 50 queries for passage and document retrieval:

Corpus used is MS Marco Passage and Document Corpus.

Available via ir_datasets: link

Colab demo (Pyserini): link
Colab demo (PyTerrier): link

Change Log

Major dataset changes historic users should be aware:

  • 4th May 2021: Topic 273695 added for passage and documents (qrels, baselines, topics.tsv, folds, etc.) We wanted a round 50 topics testset vs. original 49.
  • 20th May 2021: Topics 1056416 and 1103153 added to passage qrels (see issue #1)

Hard Queries

To differentiate system performance between large neural ranking models new challenging and complex benchmark queries are required. Hard queries were identified within the DL 2019/20 testsets through:

  1. Automatic Hard Criteria: Because manually reviewing all candidate queries is time consuming, we explore the use of annotated metadata only, without requiring knowledge of system effectiveness. Google’s web search answer type as a base with additional List and Reason query intents added to improve recall. Intent types matching Quantity, Weather, and Language (mostly dictionary lookups) are excluded.
  2. Manual Hard Criteria: Each candidate question, generated from Automatic Hard Criteria, is manually labeled by multiple authors and candidate hard queries discussed by all authors. Guidelines include: non-factoid, beyond single passage, answerable, text-focused, mostly well-formed, and possibly complex.

For example:

  • Easy query from TREC DL 2020: what is reba mcentire's net worth. BM25 achieves Recall@100 = 1.0 and neural re-rankers achieve NDCG@10 > 0.9.
  • Hard query from DL-Hard: symptoms of different types of brain bleeds. BM25 achieves Recall@100 < 0.7 and neural re-rankers achieve NDCG@10 < 0.25.

See paper for more details: link

We measure official TREC 2020 document run submissions on DL-HARD overlapping subsets and compare to the original DL Track. On an average relative basis for above-median system, DL-HARD NDCG@10 is 21.1% lower, RR is 23.2% lower, and Recall@100 is 19.6% lower. This included a new top system (‘ICIP_run1’), and each system changed on average 4.6 places. This large number of swaps supports that removing the easier queries allows for a better comparison between state-of-the-art retrieval systems.

Top 20 systems TREC 2020 document run submissions (DL-Hard vs. DL TREC):

Annotation Diagram

New Judgements

The resource uses the full provided NIST assessments for the 25 previously judged queries. There are also new passage and document judgments provided for the 25 unjudged queries from TREC DL:

  • Passage Judgements: link
  • Document Judgements (Mapping Passage-Level Judgments): link
  • Document Judgements (Document-Level Judgments): link

Experienced IR researchers perform the annotations following the DL guidelines. We find the Krippendorff’s alpha is 0.47 on the passage judgements, which indicates moderate agreement. Krippendorff’s alpha drops to 0.12 when considering the agreement on document judgments, illustrating the difficulty of automatically transferring passage judgements to documents. For this reason, we re-produced document judgments annotating at a document level, which achieved Krippendorff’s alpha of 0.430.

On further analysis, most disagreements with a difference greater than 1 relevance grade looked related to different query interpretations. e.g., accepting any definition of "geon" versus looking for a specific definition (similar to "define visceral"). To remove this ambiguity, further work will look to add query descriptions.

Note, the official DL-Hard document qrels (link) use document-level judgements and passage qrels (link) use passage-level judgements.

Annotations

Annotations are provided for 400 queries from the DL 2019/20 test datasets (link). The annotations tsv has the following columns:

  • 0: Topic id
  • 1: Query
  • 2: Query Intent: Recently developed question intent taxonomy developed for web questions [Cambazoglu et al., 2021].
  • 3: Answer Type: Manual annotation of target answer type for web questions.
  • 4: Topic Domain: Breakdown of questions by topic domain.
  • 5: SERP Result Type: Answer type provided by the Search Engine Results Page (SERP). HTML of queries found: here.

Diagram of annotations:

Annotation Diagram

See paper for full details: link

Golden entity links and high-recall results for SOTA entity linkers (REL [Van Hulst et al., 2020], BLINK [Wu et al., 2020], GENRE [De Cao et al., 2020], ELQ [Li et al., 2020]) are provided for all 400 queries from the DL 2019/20 test datasets.

Golden entity links to Wikipedia (2021/02/27) can be found: here. Also included are annotations: (1) Answer in Link: whether question is answered within linked Wikipedia page, and (2) Core Entities in Wiki: whether any core entities of the question were not found in Wikipedia.

SOTA entity linkers results can be found: here

Evaluation

Official metrics for DL-Hard are NDCG@10 and RR. For binary metrics, labels of two or greater should be considered as relevant. Thus, trec_eval command: "trec_eval -l 2 -o -c -M1000 -q -m all_trec".

Baselines

Baseline runs, tuned parameters and trec_evals can be found in the baselines directory: link. These runs utilize the standard 5-folds for cross-validation (3x train folds, 1x validation fold, 1x test fold) and the outlined trec_eval procedure.

Colab demo (Pyserini): link
Colab demo (PyTerrier): link

Document Baselines:

Initial retrieval BM25 and BM25+RM3 runs use Pyserini.

BERT-MaxP(Zero-Shot) and T5-MaxP(Zero-Shot) re-rankers use pygaggle standard models (i.e. BERT-large and T5-base) that are fine-tuned on MS-MARCO (but not fine-tuned on DL-HARD folds). The documents are sharded into 5 sentence chunks with no overlap and the max passage score is taken to represent the document (Nogueira et al., 2020) .

BERT-MaxP and Electra-MaxP is firstly fine-tuned on MS Marco, before further fine-tuning on the provided DL-HARD training folds. Performance on the validation fold is used to select the optimal epoch for the each corresponding test fold. See Dai and Callan (2019) for implementation details. PARADE-BERT and PARADE-Electra are trained in a similar procedure for DL-HARD. See Li at al. (2020) for implementation details.

System NDCG@10 RR MAP Recall@1000 Run
BM25 0.272 0.368 0.174 0.775 link
BM25+RM3 0.279 0.365 0.174 0.775 link
BM25+BERT-MaxP(Zero-Shot) 0.310 0.405 0.187 0.775 link
BM25+BERT-MaxP 0.317 0.402 0.200 0.775 link
BM25+RM3+BERT-MaxP(Zero-Shot) 0.314 0.415 0.188 0.775 link
BM25+RM3+BERT-MaxP 0.295 0.443 0.181 0.775 link
BM25+T5-MaxP(Zero-Shot) 0.327 0.367 0.184 0.775 link
BM25+RM3+T5-MaxP(Zero-Shot) 0.307 0.359 0.170 0.775 link
BM25+Electra-MaxP 0.385 0.448 0.216 0.775 link
BM25+RM3+Electra-MaxP 0.380 0.461 0.215 0.775 link
BM25+PARADE-BERT 0.299 0.413 0.174 0.775 link
BM25+RM3+PARADE-BERT 0.313 0.419 0.187 0.775 link
BM25+PARADE-Electra 0.356 0.498 0.207 0.775 link
BM25+RM3+PARADE-Electra 0.357 0.489 0.211 0.775 link
Passage Baselines:

Initial retrieval BM25 and BM25+RM3 runs use Pyserini.

BERT(Zero-Shot) and T5(Zero-Shot) re-rankers use pygaggle standard models (i.e. BERT-large and T5-base) that are fine-tuned on MS-MARCO (not further fine-tuned on DL-HARD folds).

System NDCG@10 RR MAP Recall@1000 Run
BM25 0.304 0.504 0.173 0.669 link
BM25+RM3 0.273 0.409 0.175 0.703 link
BM25+BERT(Zero-Shot) 0.399 0.558 0.229 0.669 link
BM25+RM3+BERT(Zero-Shot) 0.395 0.559 0.234 0.703 link
BM25+T5(Zero-Shot) 0.408 0.591 0.238 0.669 link
BM25+RM3+T5(Zero-Shot) 0.396 0.577 0.238 0.703 link

Future Work

Please suggest any future extensions or bug fixes on github or email ([email protected]).

Current planned work:

  • Non-NIST judged queries only judge top 10 docs from the QnA dataset (i.e. judgments sparse). Thus, creating deeper judgments.
  • Add TREC-style 'descriptions' to queries to disambiguate answers.
  • Dense retrieval baselines (ColBERT, DPR, etc.).
  • Create MS Marco duplicate passage list.
  • Add a complementary entity ranking task.

About

Deep Learning Hard (DL-HARD) is a new annotated dataset extending TREC Deep Learning benchmark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published