Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion on how to evaluate models using this benchmark. #2

Open
shiwk20 opened this issue Jul 15, 2024 · 0 comments
Open

Confusion on how to evaluate models using this benchmark. #2

shiwk20 opened this issue Jul 15, 2024 · 0 comments

Comments

@shiwk20
Copy link

shiwk20 commented Jul 15, 2024

Hi, I have read your paper carefully but I'm still very confused how to evaluate your benchmark on other models. Can you share a real example about 'Listing 1: Tutor Evaluation Prompt'? For example, how is the full_conversation formatted, what are the airesponse and student_response referred to? A conversation has many student responses, do you just evaluate the last one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant