A toolkit for evaluation of natural language generation (NLG), including BLEU, ROUGE, METEOR, and CIDEr.
Make sure the following environment is installed correctly on your machine.
python 2.7+
numpy
If you want to use the METEOR metric, make sure the Java Runtime Environment is configured on your machine.
You can run evaluation via:
python run_eval.py --hypos output_file --refs reference_file
where output_file
is the file that stored the results produced by your system, and reference_file
is the file that stored the references. Note that you need to do tokenization before evaluation.
Evaluation with multiple references is supported. The command is:
python run_eval.py --hypos output_file --refs ref_1 ref_2 ... ref_n
ref_1 ref_2 ... ref_n
are n
reference files.
You can choose any metrics you want. By default, all metrics (BLEU, ROUGE, METEOR, and CIDEr) are enabled. If you do not need a metric (e.g., BLEU), the command to disable the metric is:
python run_eval.py --hypos output_file --refs reference_file [-nB | --no_BLEU]
Similarly, you can turn off other metrics:
[-nR | --no_ROUGE] # for ROUGE
[-nM | --no_METEOR] # for METEOR
[-nC | --no_CIDEr] # for CIDEr
You can also change the n
-gram to obtain the BLEU-n
scores:
python run_eval.py --hypos output_file --refs reference_file [-n | --ngram] 3
Then you will get BLEU-n
(n <= 3
) scores. The default n
is 4.
If you want to conduct evaluations in lowercase mode, the command is:
python run_eval.py --hypos output_file --refs reference_file [-lc | --lowercase]