This directory contains the outputs of the IterX model, which are associated with the main experiments described in the paper. The directory is organized in the following format:
└── <dataset> # Name of the dataset, here are {muc4, scirex}
└── <model_name> # Name of the model, here are {iterx}
└── <encoder_name> # Name of the encoder
├── raw # Raw outputs of the model
│ └── preds.test.jsonlines # IterX outputs jsonlines where each line is a JSON object
└── comparisons # Outputs of our prediction comparison tool, illustrating comparisons between predictions and references and corresponding scores
├── test.rme.phi3.txt # Under "CEAF-RME_{phi3}" scorer
└── test.rme.subset.txt # Under "CEAF-RME_{subset}" scorer
Comparisons are generated by our prediction comparison tool. They are intended to illustrate how scores are computed
under a specific metric on a particular document.
In such comparison files, results are grouped by documents, and each document comes with a stats showing the number of
predictions and references, and scores under the corresponding metric.
For example, the following is a snippet of test.rme.phi3.txt
:
doc_id=TST3-MUC4-0003 #pred=2 #gold=1 prec=0.3750 rec=0.7500 f1=0.5000
This means that for the document TST3-MUC4-0003
, there are 2 predictions and 1 reference. The precision, recall and F1
scores are computed under the CEAF-RME_{phi3}
metric as 0.3750, 0.7500 and 0.5000 respectively.
Under the meta info line, you would see templates being aligned using the metric along with fillers, templates predicted
but not found in the references (Predicted but not matched
), and templates not predicted by models (Not predicted
).
To be noted, the alignments of slots might not be printed accurately due to the limitation of the print script, but
they do not affect the final scoring.