Datasets for training, developing and testing are in json, which should be named following the format of
$TASKNAME_$STAGE.json
.
For example, dbpedia_train.json
, dbpedia_dev.json
and dbpedia_test.json
.
In a json file, each line denotes a training sample stored in Python dict.
sentence (str)
: tokenized sentence. Tokens are separated by spaces.label (int)
: label for classification. Note it should start from 1 for our code to work well.constituency_tree_encoding (str)
: encoding of corresponding constituency tree, which is in the format of "left1, right1, father1, left2, right2, father2, ..." Leaf nodes are represented by 0~n-1. For example, encoding of the following tree should be: "3,4,6,2,6,7,1,7,8,8,5,9,0,9,10".
To be specific, the following function takes a parenthesis format binary constituency parse tree (e.g., (I ((love (my (pet cat ))).))
, not sensitive to whitespaces -- the first lines tackles this problem), and outputs the desired encoding.
def get_tree_encodings(binary_parse):
binary_parse = binary_parse.replace('(', ' ( ').replace(')', ' ) ')
sentence = binary_parse.replace('(', ' ').replace(')', ' ')
words = sentence.split()
components = binary_parse.split()
final_answers = list()
stack = list()
curr_index = 0
non_leaf_index = len(words)
for w in components:
if w == '(': # guard
stack.append(w)
elif w != ')': # shift
stack.append(curr_index)
curr_index += 1
else: # reduce
index_left = stack[-2]
index_right = stack[-1]
final_answers.append(index_left)
final_answers.append(index_right)
final_answers.append(non_leaf_index)
stack = stack[:len(stack)-3]
stack.append(non_leaf_index)
non_leaf_index += 1
assert len(stack) == 1
assert stack[0] == 2 * curr_index - 2
assert curr_index == len(words)
final_answers = [str(x) for x in final_answers]
return ','.join(final_answers)
sentence_1 (str)
,sentence_2 (str)
: tokenized sentences.sentence_1_binary_encoding (str)
,sentence_2_binary_encoding (str)
: encodings of corresponding parsing trees.gold_label (int)
: label for classification.
We apologize for the inconsistent names in the two types of classification tasks.
source_sentence (str)
: tokenized source sentence.target_sentence (str)
: tokenized target sentence.parsed_source_sentence(str)
: encoding of corresponding constituency tree of source sentence.
For vocabularies, just drop a list of words in a text file. One word per line.
This folder contains part of dbpedia_dev.json
, conj_dev.json
and mt_dev.json
, as well as snli.vocab
for
reference.