Skip to content

Berkeley Function Calling Leaderboard Updates (v1.2)

Latest
Compare
Choose a tag to compare
@ShishirPatil ShishirPatil released this 05 Jan 04:39
79b1c60

Highlights

🏆 Berkeley Function Calling Leaderboard V3 with Multi-step and Multi-turn function call evaluation

What's Changed

  • [BFCL] Package the Codebase by @devanshamin in #565
  • Added python script named as raft_local.py to raft directory to run script completely locally using HF models by @himanshushukla12 in #605
  • RAFT Enhancements: Improved robustness, logging, checkpointing, threading, Llama support, Azure auth and eval by @cedricvidal in #604
  • Fix/merge commit #605 and #604 by @ShishirPatil in #609
  • Fix issue #614: [BFCL] ModuleNotFoundError after commit 70d6722 by @kobe0938 in #615
  • Fix some bugs in test case prompts/ground truths by @aw632 in #608
  • [BFCL] Dataset and Possible Answer Fix by @HuanzhiMao in #600
  • Add Salesforce xLAM model series by @zuxin666 in #616
  • Update gemini_handler.py to better handle NL+FC model output by @vandyxiaowei in #617
  • [BFCL] Fix Decoding Issue in Nvidia Handler by @HuanzhiMao in #623
  • [BFCL] Fix Llama Handler by @HuanzhiMao in #626
  • [BFCL] add MadeAgents/Hammer-7b handler by @linqq9 in #627
  • [BFCL] Refactor Model Handler into OSS and Proprietary Components by @devanshamin in #612
  • [BFCL] Hot Fix to Remove Extra Parameters for NoAPIKeyError by @HuanzhiMao in #636
  • fix: bug for glm prompt format by @zhangch-ss in #638
  • [BFCL] Add New Model o1-preview-2024-09-12 and o1-mini-2024-09-12 by @HuanzhiMao in #635
  • [BFCL] BFCL v3 by @HuanzhiMao in #644
  • removed unnecessary comments in raft/raft_local.py by @himanshushukla12 in #654
  • [BFCL] Chore: Separate Change Log. by @HuanzhiMao in #648
  • [BFCL] Bug Fix inference_single_turn_FC function for base_handler by @HuanzhiMao in #656
  • [BFCL] Bug Fix parse_nested_value function for model_handler utils by @VishnuSuresh27 in #660
  • added Phi-3 handlers by @AndyChenYH in #640
  • Update agent arena frontend and evals by @NithikYekollu in #666
  • [BFCL] Speed Up Locally-hosted Model Inference Process by @HuanzhiMao in #671
  • [BFCL] Fix Hanging Inference for OSS Models on GPU Platforms by @HuanzhiMao in #663
  • [BFCL] Add gemini-1.5-pro-002, gemini-1.5-pro-002-FC, gemini-1.5-pro-001, gemini-1.5-pro-001-FC, gemini-1.5-flash-002, gemini-1.5-flash-002-FC, gemini-1.0-pro-002, gemini-1.0-pro-002-FC by @HuanzhiMao in #658
  • [BFCL] Add Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct by @HuanzhiMao in #657
  • [BFCL] Add ToolACE handler for BFCL-v3 by @XuHwang in #653
  • Add Qwen handler and fix mean_latency calculation error for OSS models by @zhangch-ss in #642
  • update README.md by @leosun12 in #669
  • [BFCL] Chore: Various Improvements and Adjustments by @HuanzhiMao in #673
  • [BFCL] Chore: Refactor File Path Handling and Automate apply_function_credential_config.py by @HuanzhiMao in #675
  • docs: update README.md by @eltociear in #676
  • [BFCL-v3] Multi-Turn Possible Answer Order Change by @Fanjia-Yan in #679
  • update hammer handler and add Hammer2.0 model by @linqq9 in #667
  • [BFCL] Chore: Improve Multi Turn Error Logs by @HuanzhiMao in #689
  • Update google-cloud-aiplatform dependency by @jieru-hu in #677
  • add minicpm3 4b by @Cppowboy in #633
  • [BFCL-v2] Dataset and Possible Answer Fix by @HuanzhiMao in #661
  • [BFCL] Add Gemma-2 models by @jacovkim in #696
  • add a basic bfcl command-line interface by @mattf in #621
  • Fixing BFCL-v3 multi-turn apps by @virginie-do in #701
  • [BFCL v1] Update Executable Ground Truth for REST Category by @CharlieJCJ in #708
  • [BFCL v1] Rephrase Question for Better Clarity for Java & JavaScript Categories by @HuanzhiMao in #709
  • [BFCL] Add SGLang Backend Support for OSS Local Inference by @hnyls2002 in #587
  • (typo):I've made some corrections to your repository to improve clarity by @PrathameshSPawar in #713
  • docs: Centered the Image by @bhargavshirin in #680
  • [BFCL] Multi Turn Dataset and Possible Answer Fix by @HuanzhiMao in #683
  • [BFCL] Chore: Separate out Func Doc for Multi-Turn Categories by @HuanzhiMao in #717
  • [BFCL] Multi Turn Dataset and Possible Answer Fix (Base Category) by @HuanzhiMao in #719
  • [BFCL] Multi Turn Dataset Fix (Function Doc) by @HuanzhiMao in #722
  • [BFCL] Multi Turn Dataset Fix (Base Category) by @HuanzhiMao in #723
  • [BFCL] Multi Turn Pipeline Robustness Patch by @HuanzhiMao in #724
  • [BFCL] Small typo in variable name in travel_booking.py by @daanaea in #731
  • [BFCL] Patch #724 by @HuanzhiMao in #730
  • [BFCL] Multi Turn Dataset Fix (Miss Func & Long Context) by @HuanzhiMao in #728
  • [BFCL] Multi Turn Dataset Fix (Miss Param) by @HuanzhiMao in #732
  • [BFCL] Update Eval Metric for Multi Turn Irrelevance Scenarios by @HuanzhiMao in #725
  • [BFCL] Remove duplicate in eval_runner.py by @ThomasRochefortB in #735
  • [BFCL] Support Dynamic max_tokens for Locally-Hosted Models by @HuanzhiMao in #712
  • [BFCL] Refine Evaluation Metric for Multi Turn Categories by @HuanzhiMao in #733
  • [BFCL] Adding New Model GoGoAgent by @RogueTensor in #720
  • [BFCL] Chore: Improve Inference Log Readability by @HuanzhiMao in #746
  • [BFCL Dataset Revamp 1/n] Multi-Turn (Part 1) by @Fanjia-Yan in #740
  • [BFCL] Robustness Patch for _multi_threaded_inference by @HuanzhiMao in #754
  • [BFCL] Prompt Caching for Claude Models by @VishnuSuresh27 in #751
  • [BFCL Dataset Revamp 2/n] Live Dataset Fix (Simple, Parallel, Parallel Multiple) by @Fanjia-Yan in #737
  • [BFCL Dataset Revamp 3/n] Live Dataset Fix (Multiple) by @Fanjia-Yan in #739
  • Update google-cloud-aiplatform version to 1.72.0 by @gabrielibagon in #760
  • [BFCL] Minor Grammatical Corrections to DEFAULT_SYSTEM_PROMPT by @HuanzhiMao in #747
  • [BFCL] Remove Llama-3.2-3B-Instruct-FC and Llama-3.2-1B-Instruct-FC from Leaderboard by @HuanzhiMao in #749
  • [BFCL Chore] Supply data_multi_turn.csv for Multi-Turn Evaluation Results by @HuanzhiMao in #762
  • [BFCL] Remove Workaround Patch for Vertex AI Package by @HuanzhiMao in #761
  • Add exponential retry logic for gemini models by @gabrielibagon in #764
  • [BFCL] Remove Duplicate Line in record_cost_latency by @HuanzhiMao in #767
  • Fix handling of examples with no tools in Gemini by @gabrielibagon in #770
  • Remove stop condition in gemini retry logic by @gabrielibagon in #769
  • Skip adding empty content from gemini by @gabrielibagon in #768
  • [BFCL] Add the option to log to WandB during bfcl evaluate by @ThomasRochefortB in #736
  • [BFCL] Add claude-3-5-haiku-20241022, claude-3-5-haiku-20241022-FC, claude-3-5-sonnet-20241022, claude-3-5-sonnet-20241022-FC by @HuanzhiMao in #750
  • [BFCL Dataset Revamp 4/n] Live Irrelevance by @Fanjia-Yan in #763
  • [BFCL Dataset Revamp 5/n] Multi-Turn Base WrapUp by @Fanjia-Yan in #772
  • [BFCL] Add Unit Test to Check for Illegal Python Parameter Name by @HuanzhiMao in #777
  • [BFCL] Dataset and Possible Answer Fix (Live Categories) for Illegal Python Parameter Name by @HuanzhiMao in #778
  • [BFCL] Add Support for Regeneration, Specific Test Entry IDs, and Custom Directory Locations by @Raymond112514 in #743
  • [BFCL] some tiny fix in possible_answer by @zhangch-ss in #786
  • [RAFT] Add link to Azure RAFT Distillation Recipe by @cedricvidal in #758
  • [BFCL] Add New Model Qwen/Qwen2.5-72B-Instruct by @HuanzhiMao in #787
  • [BFCL] Add DeepSeek-V2.5, DeepSeek-Coder-V2-Instruct-0724, DeepSeek-Coder-V2-Lite-Instruct, DeepSeek-V2-Chat-0628, DeepSeek-V2-Lite-Chat by @moonlight1431 in #697
  • Add minicpm3 4b FC model handler by @Cppowboy in #718
  • [BFCL] Add support for Writer models and Palmyra X 004 by @samjulien in #755
  • [BFCL Chore] Add @final and @overrides Decorators to Class Methods in Model Handler by @VishnuSuresh27 in #790
  • [BFCL Chore] Support Multiple Models and Test Category Input for BFCL CLI by @vsvaidya27 in #795
  • [BFCL] Fix Irrelevance Category Performance for DeepSeek Coder Handler by @HuanzhiMao in #796
  • [BFCL Chore] Quick fix change of decorators from @overrides to @override by @VishnuSuresh27 in #797
  • [BFCL Chore] Add Retry Mechanism with Backoff for Rate Limit Handling Across Proprietary Models by @HuanzhiMao in #781
  • [BFCL] Bug Fix for Execution_Result_Message Construction for Prompt Caching Feature in Claude Handler by @HuanzhiMao in #805
  • [BFCL Dataset Revamp 7/n] Augmented Multi-turn Dataset Fix by @Fanjia-Yan in #804
  • [BFCL Dataset Revamp 6/n] Live Relevance Data Fix by @Fanjia-Yan in #789
  • Add Weaviate APIs to Gorilla API Zoo by @CShorten in #783
  • [BFCL] Improve Latency Measurement Accuracy and Enable Default State Logging by @HuanzhiMao in #808
  • [BFCL] Replace 'class' with '_class' to Avoid Function Calling Formatting Error by @Fanjia-Yan in #811
  • [BFCL] Added Grok Handler by @amitojsingh2022 in #810
  • [BFCL] Resolve Issue in Gemini Model When No Model Output by @HuanzhiMao in #809
  • [BFCL] Add Amazon Models nova-pro-v1.0, nova-lite-v1.0, and nova-micro-v1.0 by @HuanzhiMao in #815
  • [BFCL Chore] Revamp README.md for Clearer Instructions by @HuanzhiMao in #819
  • [BFCL] Update gpt-4o Snapshot Version from 2024-08-06 to 2024-11-20 by @HuanzhiMao in #822
  • fix some enum type errors in datasets by @zhangch-ss in #826
  • Fix Merge Conflict From #826 by @HuanzhiMao in #829
  • [BFCL Chore] Add Unit Test for Valid Func Doc Format by @HuanzhiMao in #828
  • update hammer handler and add Hammer2.1 model by @linqq9 in #832
  • [BFCL] Add New Model Llama-3.3-70B-Instruct, Llama-3.3-70B-Instruct-FC by @HuanzhiMao in #837
  • [BFCL] Add o1-2024-12-17 and o1-2024-12-17-FC by @HuanzhiMao in #840
  • Add Cohere Command R7B, replace older Command R+ handler by @harry-cohere in #835
  • [BFCL Dataset] Ground Truth Error Fix by @Fanjia-Yan in #846
  • [BFCL] Add Qwen2.5-0.5B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-14B-Instruct, Qwen2.5-32B-Instruct by @HuanzhiMao in #842
  • [BFCL] Add New Model watt-tool-8B and watt-tool-70B by @zhanghanduo in #847
  • [BFCL] Skip Executable Categories When API Keys Missing by @HuanzhiMao in #848
  • [BFCL] Add gemini-2.0-flash-exp-FC, gemini-2.0-flash-exp, gemini-exp-1206-FC, gemini-exp-1206 by @HuanzhiMao in #843
  • Check and fix some parameter type errors in possible answers by @zhangch-ss in #838
  • [BFCL] Use N/A in Score Report for Unevaluated Categories by @HuanzhiMao in #849
  • [BFCL] possible answer fix: - reigion ->region by @sghyan16 in #852
  • [BFCL] Add Mistral Local Serving Handler and Add New Model mistralai/Ministral-8B-Instruct-2410 by @HuanzhiMao in #855
  • [BFCL] Add New Model DeepSeek-V3 by @HuanzhiMao in #857
  • [BFCL] Rename Directories: proprietary_model->api_inference, oss_model->local_inference for Better Clarity by @HuanzhiMao in #859
  • [BFCL] Support for pre-existing completion endpoint by @ThomasRochefortB in #864
  • [BFCL Chore] Ensure Correct Input Format for Eval Checker by @HuanzhiMao in #860

New Contributors

Full Changelog: v1.1...v1.2