Name		Name	Last commit message	Last commit date
parent directory ..
HealthQA		HealthQA
MedQuAD		MedQuAD
Train		Train
WikiSectionQA		WikiSectionQA
README.md		README.md

README.md

Annotations for WikiSection / MedQuAD / HealthQA

This folder contains additional annotations for three datasets WikiSection, MedQuAD and Health QA. As we cannot provide the original texts, it is required that you download the individual datasets into their folder and run convert.sh to create the document files.

If you use these annotations in your work, please cite:

@inproceedings{arnold2020learning,
  author = {Arnold, Sebastian and {van Aken}, Betty and Grundmann, Paul and Gers, Felix A. and L{\"o}ser, Alexander},
  title = {Learning {{Contextualized Document Representations}} for {{Healthcare Answer Retrieval}}},
  booktitle = {Proceedings of The Web Conference 2020 (WWW '20)},
  year = {2020},
  doi = {10.1145/3366423.3380208}
}

Format

These files are created by convert.sh and contain tab-separated values with the following fields:

DATASET_test_docs.tsv

Document/passage full-text with one sentence per line.

field	description
`doc_id`	A corpus-wide document ID. Sentences with same `doc_id` belong to the same document.
`p_id`	corpus-wide passage ID. sentences with same `p_id` belong to the same passage.
`t`	Sequential sentence index in the range `0` to `T-1`
`text`	Plain text of the sentence, non-tokenized.

DATASET_test_queries.tsv

One query-answer candidate pair per line, referring to the passages above.

field	description
`query_id`	A corpus-wide query ID.
`relevance`	`1` if the answer is relevant to the query, `0` if it is not relevant. Every query has at most 64 candidate answers.
`doc_id`	Reference to the document ID containing the candidate answer.
`p_id`	Reference to the passage ID that is the candidate answer.
`question`	The question in natural language.
`entity_id`	Wikidata ID of the entity focused in the question, e.g. "Q2140130".
`entity_name`	Canonical name of the entity focused in the question, e.g. "Lateral medullary syndrome".
`aspect_label`	Normalized short label for the question aspect, e.g. "symptom".
`aspect_heading`	Slightly longer description of the question aspect from UMLS, e.g. "signs and symptoms".

DATASET_test_matchzoo.tsv

Passage full-text with one query-answer candidate pair per line for supervised training and evaluation in MatchZoo.

field	description
`relevance`	`1` if the answer is relevant to the query, `0` if it is not relevant. Every query has at most 64 candidate answers, always 10 for training data.
`query`	The query in the form `entity ; aspect`.
`text`	Plain text of the passage, tokenized.

DATASET_train_labels.tsv

Entity/aspect training labels with one sequential sentence position per line.

field	description
`doc_id`	Reference to the document ID.
`p_num`	Sequential passage index per document.
`t_start`	Sequential sentence index. The current label is valid for sentence `t_start` and all following sentences until the next label.
`entity_ids`	Semicolon-separated list of entities focused in the sentences.
`entity_names`	Semicolon-separated list of canonical entity names focused in the sentences.
`aspect_labels`	Semicolon-separated list of normalized short aspect labels describing the sentences.
`aspect_headings`	Semicolon-separated list of section headings describing the sentences.

License

The licenses of the individual datasets apply. All additional annotations contained in this folder are released under the Creative Commons Attribution-ShareAlike 3.0 Unported License. You should have received a copy of the license along with this work. If not, see [http://creativecommons.org/licenses/by-sa/3.0/].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Annotations for WikiSection / MedQuAD / HealthQA

Format

DATASET_test_docs.tsv

DATASET_test_queries.tsv

DATASET_test_matchzoo.tsv

DATASET_train_labels.tsv

License

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Annotations for WikiSection / MedQuAD / HealthQA

Format

DATASET_test_docs.tsv

DATASET_test_queries.tsv

DATASET_test_matchzoo.tsv

DATASET_train_labels.tsv

License