Human evaluation results from Google Sheet not reproducible? #6

nils-hde · 2023-08-11T09:22:29Z

I am wondering how the human evaluation scores were computed exactly in this sheet https://docs.google.com/spreadsheets/d/1THEh9MRPWQCC1v4DH5WTw0Gq8TyV9zncWWUL08drtUY/edit#gid=452616194

For reference, here is what we end up (most-right column) with when taking the results from the current master branch (furthermore, Team 7 is missing entirely): https://docs.google.com/spreadsheets/d/1oEtzLyouTR-numPKS4WtMPSQTD6m9IzutXbtGwNNY5A/edit#gid=452616194

The absolute values and also the rankings are different. We compute the average over all generated responses and then multiply by the Detection F1-Score as provided in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Human evaluation results from Google Sheet not reproducible? #6

Human evaluation results from Google Sheet not reproducible? #6

nils-hde commented Aug 11, 2023

Human evaluation results from Google Sheet not reproducible? #6

Human evaluation results from Google Sheet not reproducible? #6

Comments

nils-hde commented Aug 11, 2023