Skip to content

Latest commit

 

History

History

build_datasets

Constructing RuleTaker variants for fact verification

Scripts to build RuleTaker-CWA and RuleTaker-Skip-fact from the original RuleTaker dataset Clark et al. 2020.

First download the RuleTaker dataset,

wget http://data.allenai.org/rule-reasoning/rule-reasoning-dataset-V2020.2.4.zip
unzip rule-reasoning-dataset-V2020.2.4.zip

Prepare RuleTaker-CWA

To prepare RuleTaker-CWA dataset,

mkdir ruletaker-cwa
for split in train dev test; do
    python prepare_RuleTaker_CWA.py \
        rule-reasoning-dataset-V2020.2.4/depth-3ext-NatLang/${split}.jsonl \
        rule-reasoning-dataset-V2020.2.4/depth-3ext-NatLang/meta-${split}.jsonl \
        ruletaker-cwa/${split}.jsonl
done

Prepare RuleTaker-Skip-fact

Note: due to the inherent randomness in the algorithm, you might get slightly different skip-fact variants in each run. To reproduce the numbers in the original paper, please use the released RuleTaker-Skip-fact dataset (link).

To prepare a RuleTaker-Skip-fact dataset,

mkdir ruletaker-skipfact
for split in train dev test; do
    python prepare_RuleTaker_Skipfact.py \
        rule-reasoning-dataset-V2020.2.4/depth-3ext-NatLang/${split}.jsonl \
        rule-reasoning-dataset-V2020.2.4/depth-3ext-NatLang/meta-${split}.jsonl \
        ruletaker-skipfact/${split}.jsonl
done

Creating entity anonymized FEVER dataset

Script for doing so is create_data_anonymized.py. Run the script as follows:

python create_data_anonymized.py ./bert_dev.json ./anon_dev.json
python create_data_anonymized.py ./bert_train.json ./anon_train.json

wherein the first argument is the input file and the second one is the output file. Typically, the input file is the one that is produced after sentence-retrieval step in FEVER task. The ./bert_dev.json and ./bert_train.json files can be downloaded from the original KGAT drive folder.