You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have read your paper carefully but I'm still very confused how to evaluate your benchmark on other models. Can you share a real example about 'Listing 1: Tutor Evaluation Prompt'? For example, how is the full_conversation formatted, what are the airesponse and student_response referred to? A conversation has many student responses, do you just evaluate the last one?
The text was updated successfully, but these errors were encountered:
Hi, I have read your paper carefully but I'm still very confused how to evaluate your benchmark on other models. Can you share a real example about 'Listing 1: Tutor Evaluation Prompt'? For example, how is the full_conversation formatted, what are the airesponse and student_response referred to? A conversation has many student responses, do you just evaluate the last one?
The text was updated successfully, but these errors were encountered: