Reproduce experimental results. #7

Taeyoung-Jang · 2024-04-21T07:13:56Z

Thank you for your great work!

I just ran the evaluation pipeline and checked the pass rates for toolllama v2, gpt3.5-turbo, and gpt4-turbo. However, all the pass rates are significantly lower than the scores presented in the experiment.

I have confirmed that gpt4-turbo is being used both on the server and during the evaluation process. Are there any considerations that should be taken into account during the inference process to obtain results?

I am curious if there are any hyperparameters used to achieve results similar to those obtained in the experiment. (I think there can be an error margin of up to 5% in reproducing the experiment.)

The text was updated successfully, but these errors were encountered:

zhichengg · 2024-04-26T05:34:23Z

Hi, Thank you for your interest in this work.

We are experiencing two issues currently that may cause the reproducibility problem:

Firstly, the real API server maintained by the ToolBench team is faced with instability problems. Many of the calls to real APIs returned 500 as reported by other users. We are investigating this and will hopefully fix it soon. You can double check your replicated trajectories to see whether you are facing this problem.
Secondly, the OpenA updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09 and publish our model inference. We are also training our own evaluator model with an open-source model to replace these closed-source models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce experimental results. #7

Reproduce experimental results. #7

Taeyoung-Jang commented Apr 21, 2024 •

edited

Loading

zhichengg commented Apr 26, 2024 •

edited

Loading

Reproduce experimental results. #7

Reproduce experimental results. #7

Comments

Taeyoung-Jang commented Apr 21, 2024 • edited Loading

zhichengg commented Apr 26, 2024 • edited Loading

Taeyoung-Jang commented Apr 21, 2024 •

edited

Loading

zhichengg commented Apr 26, 2024 •

edited

Loading