This is an internal repository containing the workdirs with project-specific scripts that are called from kubeflow pipelines.
Three repositories are required to test run dataset preparation locally:
Install them separately, use master
branch for wikidata-workdir
, wip/entity-recovery-mode
branch for qald
repository, and wip/qald
for genie-toolkit
. Then inside the wikidata-workdir
directory, create configuration file config.mk
with the following lines
geniedir=<PATH_TO_YOUR_GENIE_INSTALLATION>
qalddir=<PATH_TO_YOUR_QALD_INSTALLATION>
To generate a sample dataset, run the following command:
make datadir
This will generate a small sample dataset with oracle NED. If ReFinED entity linker is desired, add the following options to the command:
entity_recovery_mode=true
refined_model=models/refined
ned=refined
synthetic_ned=refined
To evaluate an existing model:
-
Install genienlp.
-
Download the model using the following command, where
<path>
is the folder containing the model under azure bucketpvc-a8853620-9ac7-4885-a30e-0ec357f17bb6
. The model will be downloaded undermodels/<model_name>
.
./sync-models.sh <path> <model_name>
- Run the following command to evaluate, where
<eval_set>
iseval
for dev set andtest
for test set.
make \
refined_model=models/refined \
entity_recovery_mode=true \
ned=refined \
metric=answer \
eval_set=<eval_set> \
<eval_set>/<model_name>.results
Note that generating manifest.tt
file takes very long. Once it's generated and no update is needed, option update_manifest=false
to all make commands above to save time.
If some command failed in the middle or there is a dataset update, run make safe-clean
to clean up the folder before rerun the command.