gpt3.5 > gpt4 on pass rate? #14

stanpcf · 2024-06-24T09:20:19Z

great work for tool use!!!
how ever, I had some question about the result. I would be grateful if you reply~~

1. In `StableToolBench` I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?

while in `ToolBench` which is gpt4 > gpt3.5 (both react and dfs)

2. much diff vs paper report

below is my rerun result on pass rate.

gpt-4-turbo-preview_cot(report on github) is paper report.
gpt-4-turbo-preview_cot(based data_baselines rerun), first download inference answer from https://huggingface.co/datasets/stabletoolbench/baselines, then use dir gpt-4-turbo-preview_cot run pass rate which eval-model is gpt-4-turbo-2024-04-09. which has much diff vs report.
gpt-4-turbo-2024-04-09_cot, run inference via script inference_chatgpt_pipeline_virtual.sh, GPT_MODEL is gpt-4-turbo-2024-04-09. eval model is gpt-4-turbo-2024-04-09.
gpt-4-turbo-2024-04-09_cot_rerun, is same as gpt-4-turbo-2024-04-09_cot but run again. which show that eval pass rate is stable.

The text was updated successfully, but these errors were encountered:

stanpcf · 2024-06-25T12:35:13Z

baselines https://huggingface.co/datasets/stabletoolbench/baselines
file data_baselines/gpt-4-turbo-preview_dfs/G1_instruction/4505_DFS_woFilter_w2.json is broken

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt3.5 > gpt4 on pass rate? #14

gpt3.5 > gpt4 on pass rate? #14

stanpcf commented Jun 24, 2024 •

edited

Loading

stanpcf commented Jun 25, 2024 •

edited

Loading

gpt3.5 > gpt4 on pass rate? #14

gpt3.5 > gpt4 on pass rate? #14

Comments

stanpcf commented Jun 24, 2024 • edited Loading

1. In StableToolBench I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?

2. much diff vs paper report

stanpcf commented Jun 25, 2024 • edited Loading

stanpcf commented Jun 24, 2024 •

edited

Loading

1. In `StableToolBench` I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?

stanpcf commented Jun 25, 2024 •

edited

Loading