Skip to content

Latest commit



72 lines (61 loc) · 3.03 KB

File metadata and controls

72 lines (61 loc) · 3.03 KB

Instructions for Dataset Construction

Datasets (train/dev/test)

Datasets for training, developing and testing are in json, which should be named following the format of $TASKNAME_$STAGE.json. For example, dbpedia_train.json, dbpedia_dev.json and dbpedia_test.json.

In a json file, each line denotes a training sample stored in Python dict.

Keys of a Training Sample

Sentence Classification

  • sentence (str): tokenized sentence. Tokens are separated by spaces.
  • label (int): label for classification. Note it should start from 1 for our code to work well.
  • constituency_tree_encoding (str): encoding of corresponding constituency tree, which is in the format of "left1, right1, father1, left2, right2, father2, ..." Leaf nodes are represented by 0~n-1. For example, encoding of the following tree should be: "3,4,6,2,6,7,1,7,8,8,5,9,0,9,10".


To be specific, the following function takes a parenthesis format binary constituency parse tree (e.g., (I ((love (my (pet cat ))).)), not sensitive to whitespaces -- the first lines tackles this problem), and outputs the desired encoding.

def get_tree_encodings(binary_parse):
    binary_parse = binary_parse.replace('(', ' ( ').replace(')', ' ) ')
    sentence = binary_parse.replace('(', ' ').replace(')', ' ')
    words = sentence.split()
    components = binary_parse.split()
    final_answers = list()
    stack = list()
    curr_index = 0
    non_leaf_index = len(words)
    for w in components:
        if w == '(':  # guard
        elif w != ')':  # shift
            curr_index += 1
        else:  # reduce
            index_left = stack[-2]
            index_right = stack[-1]
            stack = stack[:len(stack)-3]
            non_leaf_index += 1
    assert len(stack) == 1
    assert stack[0] == 2 * curr_index - 2
    assert curr_index == len(words)
    final_answers = [str(x) for x in final_answers]
    return ','.join(final_answers)

Sentence Relation Classification

  • sentence_1 (str), sentence_2 (str): tokenized sentences.
  • sentence_1_binary_encoding (str), sentence_2_binary_encoding (str): encodings of corresponding parsing trees.
  • gold_label (int): label for classification.

We apologize for the inconsistent names in the two types of classification tasks.

Sentence Generation

  • source_sentence (str): tokenized source sentence.
  • target_sentence (str): tokenized target sentence.
  • parsed_source_sentence(str): encoding of corresponding constituency tree of source sentence.


For vocabularies, just drop a list of words in a text file. One word per line.


This folder contains part of dbpedia_dev.json, conj_dev.json and mt_dev.json, as well as snli.vocab for reference.