- Berkeley Function Calling Leaderboard (BFCL)
We introduce the Berkeley Function Calling Leaderboard (BFCL), the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions. Unlike previous evaluations, BFCL accounts for various forms of function calls, diverse scenarios, and executability.
💡 Read more in our blog posts:
🦍 See the live leaderboard at Berkeley Function Calling Leaderboard
# Create a new Conda environment with Python 3.10
conda create -n BFCL python=3.10
conda activate BFCL
# Clone the Gorilla repository
git clone https://github.com/ShishirPatil/gorilla.git
# Change directory to the `berkeley-function-call-leaderboard`
cd gorilla/berkeley-function-call-leaderboard
# Install the package in editable mode
pip install -e .
For locally hosted models, choose one of the following backends, ensuring you have the right GPU and OS setup:
sglang
is much faster than vllm
but only supports newer GPUs with SM 80+ (Ampere etc).
If you are using an older GPU (T4/V100), you should use vllm
instead as it supports a much wider range of GPUs.
Using vllm
:
pip install -e .[oss_eval_vllm]
Using sglang
:
pip install -e .[oss_eval_sglang]
Optional: If using sglang
, we recommend installing flashinfer
for speedups. Find instructions here.
We store environment variables in a .env
file. We have provided a example .env.example
file in the gorilla/berkeley-function-call-leaderboard
directory. You should make a copy of this file, and fill in the necessary values.
cp .env.example .env
# Fill in necessary values in `.env`
If you are running any proprietary models, make sure the model API keys are included in your .env
file. Models like GPT, Claude, Mistral, Gemini, Nova, will require them.
If you want to run executable test categories, you must provide API keys. Add the keys to your .env
file, so that the placeholder values used in questions/params/answers can be replaced with real data.
There are 4 API keys to include:
-
RAPID-API Key: https://rapidapi.com/hub
- Yahoo Finance: https://rapidapi.com/sparior/api/yahoo-finance15
- Real Time Amazon Data : https://rapidapi.com/letscrape-6bRBa3QguO5/api/real-time-amazon-data
- Urban Dictionary: https://rapidapi.com/community/api/urban-dictionary
- Covid 19: https://rapidapi.com/api-sports/api/covid-193
- Time zone by Location: https://rapidapi.com/BertoldVdb/api/timezone-by-location
All the Rapid APIs we use have free tier usage. You need to subscribe to those API providers in order to have the executable test environment setup but it will be free of charge!
-
Exchange Rate API: https://www.exchangerate-api.com
-
OMDB API: http://www.omdbapi.com/apikey.aspx
-
Geocode API: https://geocode.maps.co/
The evaluation script will automatically search for dataset files in the default ./data/
directory and replace the placeholder values with the actual API keys you provided in the .env
file.
MODEL_NAME
: For available models, please refer to SUPPORTED_MODELS.md. If not specified, the default modelgorilla-openfunctions-v2
is used.TEST_CATEGORY
: For available test categories, please refer to TEST_CATEGORIES.md. If not specified, all categories are included by default.
You can provide multiple models or test categories by separating them with commas. For example:
bfcl generate --model claude-3-5-sonnet-20241022-FC,gpt-4o-2024-11-20-FC --test-category parallel,multiple,exec_simple
- All generated model responses are stored in
./result/
folder, organized by model and test category:result/MODEL_NAME/BFCL_v3_TEST_CATEGORY_result.json
- To use a custom directory for the result file, specify using
--result-dir
; path should be relative to theberkeley-function-call-leaderboard
root folder,
An inference log is included with the model responses to help analyze/debug the model's performance, and to better understand the model behavior. For more verbose logging, use the --include-input-log
flag. Refer to LOG_GUIDE.md for details on how to interpret the inference logs.
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1
- Use
--num-threads
to control the level of parallel inference. The default (1
) means no parallelization. - The maximum allowable threads depends on your API’s rate limits.
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --backend {vllm|sglang} --num-gpus 1 --gpu-memory-utilization 0.9
- Choose your backend using
--backend vllm
or--backend sglang
. The default backend isvllm
. - Control GPU usage by adjusting
--num-gpus
(default1
, relevant for multi-GPU tensor parallelism) and--gpu-memory-utilization
(default0.9
), which can help avoid out-of-memory errors.
If you have a server already running (e.g., vLLM in a SLURM cluster), you can bypass the vLLM/sglang setup phase and directly generate responses by using the --skip-server-setup
flag:
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --skip-server-setup
In addition, you should specify the endpoint and port used by the server. By default, the endpoint is localhost
and the port is 1053
. These can be overridden by the VLLM_ENDPOINT
and VLLM_PORT
environment variables in the .env
file:
VLLM_ENDPOINT=localhost
VLLM_PORT=1053
For those who prefer using script execution instead of the CLI, you can run the following command:
# Make sure you are inside the `berkeley-function-call-leaderboard` directory
python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY
When specifying multiple models or test categories, separate them with spaces, not commas. All other flags mentioned earlier are compatible with the script execution method as well.
Important: You must have generated the model responses before running the evaluation.
Once you have the results, run:
bfcl evaluate --model MODEL_NAME --test-category TEST_CATEGORY
The MODEL_NAME
and TEST_CATEGORY
options are the same as those used in the Generating LLM Responses section. For details, refer to SUPPORTED_MODELS.md and TEST_CATEGORIES.md.
If in the previous step you stored the model responses in a custom directory, you should specify it using the --result-dir
flag; path should be relative to the berkeley-function-call-leaderboard
root folder.
Note: For unevaluated test categories, they will be marked as
N/A
in the evaluation result csv files. For summary columns (e.g.,Overall Acc
,Non_Live Overall Acc
,Live Overall Acc
, andMulti Turn Overall Acc
), the score reported will treat all unevaluated categories as 0 during calculation.
For executable categories, if the API Keys are not provided, the evaluation process will skip those categories and treat them as if they were not evaluated.
If any of your test categories involve executable tests (e.g., category name contains exec
or rest
), you can set the --api-sanity-check
flag (or -c
for short) to have the evaluation process perform a sanity check on all REST API endpoints involved. If any of them are not behaving as expected, you will be alerted in the console; the evaluation process will continue regardless.
Evaluation scores are stored in ./score/
, mirroring the structure of ./result/
: score/MODEL_NAME/BFCL_v3_TEST_CATEGORY_score.json
- To use a custom directory for the score file, specify using
--score-dir
; path should be relative to theberkeley-function-call-leaderboard
root folder.
Additionally, four CSV files are generated in ./score/
:
data_overall.csv
– Overall scores for each model. This is used for updating the leaderboard.data_live.csv
– Detailed breakdown of scores for each Live (single-turn) test category.data_non_live.csv
– Detailed breakdown of scores for each Non-Live (single-turn) test category.data_multi_turn.csv
– Detailed breakdown of scores for each Multi-Turn test category.
If you'd like to log evaluation results to WandB artifacts:
pip install -e.[wandb]
Mkae sure you also set WANDB_BFCL_PROJECT=ENTITY:PROJECT
in .env
.
For those who prefer using script execution instead of the CLI, you can run the following command:
# Make sure you are inside the `berkeley-function-call-leaderboard/bfcl/eval_checker` directory
cd bfcl/eval_checker
python eval_runner.py --model MODEL_NAME --test-category TEST_CATEGORY
When specifying multiple models or test categories, separate them with spaces, not commas. All other flags mentioned earlier are compatible with the script execution method as well.
We welcome contributions! To add a new model:
- Review
bfcl/model_handler/base_handler.py
and/orbfcl/model_handler/local_inference/base_oss_handler.py
(if your model is hosted locally). - Implement a new handler class for your model.
- Update
bfcl/model_handler/handler_map.py
andbfcl/eval_checker/model_metadata.py
. - Submit a Pull Request.
For detailed steps, please see the Contributing Guide.
- Gorilla Discord (
#leaderboard
channel) - Project Website
All the leaderboard statistics, and data used to train the models are released under Apache 2.0. Gorilla is an open source effort from UC Berkeley and we welcome contributors. Please email us your comments, criticisms, and questions. More information about the project can be found at https://gorilla.cs.berkeley.edu/