Fixed `EVAL_PROMPT_TEMPLATE` to handle empty string or multiple match answers #724

jamesbraza · 2024-11-26T19:05:15Z

This PR moves the evaluation prompt template to handle

Empty string, which relates to Empty session.answer for exhaustive search without an answer #723
Multiple matches, which fixes Failures to correct evaluate non-single letter answers #709

sidnarayanan

How do the benchmarks move with this change?

mskarlin · 2024-11-26T19:23:42Z

paperqa/prompts.py

@@ -98,9 +98,12 @@
 # Prompt templates for use with LitQA
 QA_PROMPT_TEMPLATE = "Q: {question}\n\nOptions:\n{options}"
 EVAL_PROMPT_TEMPLATE = (
-    "Extract the single letter answer from the following question and answer"


I think this is a solid iteration of the eval prompt template to address #709.

But I don't think this addresses Empty session.answer for exhaustive search without an answer #723

We still need the ability to inform the user of what happened at this step -- they just get a blank answer object currently. If the "user" is the eval script (as here) then this takes care of it.

Oh so you'd like to see something like the paper-qa runner script swapping out empty answer for some sentinel value? Can you elaborate where in the stack you'd like to see a fix, beyond LitQAEvaluation?

Yea, exactly -- after the runner finishes, we need some sentinel to be added. The blank isn't intuitive enough for users to understand (especially via command line or UI). Can def be another PR.

In cases like this eval though -- the sentinel will create a need for another prompt, so we could make the sentinel a module level variable and check for it in the eval method.

I see what you're saying, am I correct that you took care of this in #726?

Regardless, this PR will no longer close #723

jamesbraza · 2024-11-26T19:30:54Z

How do the benchmarks move with this change?

I am going to be running this shortly. With the empty answers, previously sometimes they got marked as correct, sometimes incorrect, and sometimes unsure. I am not sure which is more frequent, but that will likely dictate the impact on benchmark performance.

Fwiw our mini test suite's benchmark has grown here.

jamesbraza · 2024-11-27T02:12:16Z

@sidnarayanan I will circle back with the impact of evaluation, going to merge this for now

jamesbraza added 3 commits November 26, 2024 13:37

Named from_question cases, and remade cassettes

1623082

Added LitQA question case

fda92f2

Updated prompt to fix issues and updated cassettes

1b9f27f

jamesbraza added the bug Something isn't working label Nov 26, 2024

jamesbraza requested review from whitead, sidnarayanan, mskarlin, maykcaldas and nadolskit November 26, 2024 19:05

jamesbraza self-assigned this Nov 26, 2024

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 26, 2024

jamesbraza mentioned this pull request Nov 26, 2024

Fixing LitQAEvaluation bugs: incorrect reward indices, not using LLM's native knowledge #708

Merged

sidnarayanan reviewed Nov 26, 2024

View reviewed changes

mskarlin reviewed Nov 26, 2024

View reviewed changes

mskarlin approved these changes Nov 26, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 26, 2024

jamesbraza merged commit 31f9825 into main Nov 27, 2024
3 of 5 checks passed

jamesbraza deleted the better-evals branch November 27, 2024 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed `EVAL_PROMPT_TEMPLATE` to handle empty string or multiple match answers #724

Fixed `EVAL_PROMPT_TEMPLATE` to handle empty string or multiple match answers #724

jamesbraza commented Nov 26, 2024 •

edited

Loading

sidnarayanan left a comment

mskarlin Nov 26, 2024

jamesbraza Nov 26, 2024

mskarlin Nov 26, 2024

mskarlin Nov 26, 2024

jamesbraza Nov 27, 2024

jamesbraza commented Nov 26, 2024

jamesbraza commented Nov 27, 2024

Fixed EVAL_PROMPT_TEMPLATE to handle empty string or multiple match answers #724

Fixed EVAL_PROMPT_TEMPLATE to handle empty string or multiple match answers #724

Conversation

jamesbraza commented Nov 26, 2024 • edited Loading

sidnarayanan left a comment

Choose a reason for hiding this comment

mskarlin Nov 26, 2024

Choose a reason for hiding this comment

jamesbraza Nov 26, 2024

Choose a reason for hiding this comment

mskarlin Nov 26, 2024

Choose a reason for hiding this comment

mskarlin Nov 26, 2024

Choose a reason for hiding this comment

jamesbraza Nov 27, 2024

Choose a reason for hiding this comment

jamesbraza commented Nov 26, 2024

jamesbraza commented Nov 27, 2024

Fixed `EVAL_PROMPT_TEMPLATE` to handle empty string or multiple match answers #724

Fixed `EVAL_PROMPT_TEMPLATE` to handle empty string or multiple match answers #724

jamesbraza commented Nov 26, 2024 •

edited

Loading