-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper) #127
Comments
Please read the paper carefully. You can find all the prompt in appendix or code. The results are different because 1. we are not using the same prompt. 2. we are not using exactly the same envrionment. |
Thanks for coming back @zhc7.
The reason for asking about this question is to understand whether you were able to get close to the results reported in ReAct and what the exact difference might be, as the results of ReAct seem quite impossible to reproduce. |
Hi, @ai-nikolai sorry for the late reply, we've been quite busy lately. To answer your question, I believe the main difference is the prompting technique. We weren't aiming to reproduce the ReAct's result, but to design a prompt and a evaluation process that is relatively fair to all the models. The prompt we used is listed in paper Appendix G. The evaluation process is located at AgentBench/src/server/tasks/alfworld/task.py Line 105 in 2f3c343
The main differences are about adapting the alfworld to the framework and set some limitations and rules to avoid prolonged evaluation. To sum up, you may have to do some more investigations on this problem. |
Bug / Assistance Description
The results that are reported in the HH column are very different to the ReAct paper. In particular, ReAct reports
To Reproduce
See screenshots below. Your results in HH column indicate 16% success for text-davinci-002 or gpt-3.5-turbo. However, the reults using text-davinci-002 on ReAct indicate 78% (second screenshot). This is a significant difference.
Screenshots or Terminal Copy&Paste
Concrete Questions / Actions:
Please tell us:
The text was updated successfully, but these errors were encountered: