You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
while in `ToolBench` which is gpt4 > gpt3.5 (both react and dfs)
2. much diff vs paper report
below is my rerun result on pass rate.
gpt-4-turbo-preview_cot(report on github) is paper report. gpt-4-turbo-preview_cot(based data_baselines rerun), first download inference answer from https://huggingface.co/datasets/stabletoolbench/baselines, then use dir gpt-4-turbo-preview_cot run pass rate which eval-model is gpt-4-turbo-2024-04-09. which has much diff vs report. gpt-4-turbo-2024-04-09_cot, run inference via script inference_chatgpt_pipeline_virtual.sh, GPT_MODEL is gpt-4-turbo-2024-04-09. eval model is gpt-4-turbo-2024-04-09. gpt-4-turbo-2024-04-09_cot_rerun, is same as gpt-4-turbo-2024-04-09_cot but run again. which show that eval pass rate is stable.
The text was updated successfully, but these errors were encountered:
great work for tool use!!!
how ever, I had some question about the result. I would be grateful if you reply~~
1. In
while in `ToolBench` which is gpt4 > gpt3.5 (both react and dfs)StableToolBench
I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?2. much diff vs paper report
below is my rerun result on
pass rate
.gpt-4-turbo-preview_cot(report on github)
is paper report.gpt-4-turbo-preview_cot(based data_baselines rerun)
, first download inference answer from https://huggingface.co/datasets/stabletoolbench/baselines, then use dirgpt-4-turbo-preview_cot
run pass rate which eval-model isgpt-4-turbo-2024-04-09
. which has much diff vs report.gpt-4-turbo-2024-04-09_cot
, run inference via scriptinference_chatgpt_pipeline_virtual.sh
, GPT_MODEL isgpt-4-turbo-2024-04-09
. eval model isgpt-4-turbo-2024-04-09
.gpt-4-turbo-2024-04-09_cot_rerun
, is same asgpt-4-turbo-2024-04-09_cot
but run again. which show that eval pass rate is stable.The text was updated successfully, but these errors were encountered: