This folder contains evaluation harness for evaluating agents on the GAIA benchmark.
Please follow instruction here to setup your local development environment and LLM.
We are using the GAIA dataset hosted on Hugging Face.
Please accept the terms and make sure to have logged in on your computer by huggingface-cli login
before running the evaluation.
Following is the basic command to start the evaluation. Here we are evaluating on the validation set for the 2023_all
split. You can adjust ./evaluation/gaia/scripts/run_infer.sh
to change the subset you want to evaluate on.
./evaluation/gaia/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [gaia_subset]
# e.g., ./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 300
where model_config
is mandatory, while git-version
, agent
, eval_limit
and gaia_subset
are optional.
-
model_config
, e.g.eval_gpt4_1106_preview
, is the config group name for your LLM settings, as defined in yourconfig.toml
, defaulting togpt-3.5-turbo
-
git-version
, e.g.HEAD
, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like0.6.2
. -
agent
, e.g.CodeActAgent
, is the name of the agent for benchmarks, defaulting toCodeActAgent
. -
eval_limit
, e.g.10
, limits the evaluation to the firsteval_limit
instances, defaulting to all instances. -
gaia_subset
, GAIA benchmark has multiple subsets:2023_level1
,2023_level2
,2023_level3
,2023_all
, defaulting to2023_level1
.
For example,
./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 10
Then you can get stats by running the following command:
python ./evaluation/gaia/get_score.py \
--file <path_to/output.json>