Confusion on how to evaluate models using this benchmark. #2

shiwk20 · 2024-07-15T08:16:56Z

Hi, I have read your paper carefully but I'm still very confused how to evaluate your benchmark on other models. Can you share a real example about 'Listing 1: Tutor Evaluation Prompt'? For example, how is the full_conversation formatted, what are the airesponse and student_response referred to? A conversation has many student responses, do you just evaluate the last one?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion on how to evaluate models using this benchmark. #2

Confusion on how to evaluate models using this benchmark. #2

shiwk20 commented Jul 15, 2024 •

edited

Loading

Confusion on how to evaluate models using this benchmark. #2

Confusion on how to evaluate models using this benchmark. #2

Comments

shiwk20 commented Jul 15, 2024 • edited Loading

shiwk20 commented Jul 15, 2024 •

edited

Loading