-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReasoningAgent benchmarking with SimpleBench #293
base: main
Are you sure you want to change the base?
Conversation
Thanks. How about adding the test into the contrib-openai CI? |
benchmark/run_simple-bench.py
Outdated
|
||
if match: | ||
extracted_answer = match.group(1) | ||
results.append({"question_id": question_id, "answer": answer, "generated_answer": extracted_answer}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also cache the ans
and summary
for debugging purposes.
Also, let's record the accuracy metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ive tried to cache the summary directly with the results.json
, let me know if this works or is there anything im missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There are some small issues (filename mismatch) that the code would not run.
can you please mention if it is for the ReasoningAgent or for the Benchmark? |
I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval. |
Signed-off-by: Mark Sze <[email protected]>
I've tested with Anthropic, Gemini, DeepSeek, committed a summary file. The strongest results are from Anthropic (and Anthropic's chat UI scored the highest). |
the numbers are really interesting, i was thinking lets not add in the chat UI results in the benchmarking of the ReasoningAgent, lets just restrict the results to only the reasoning agent? cc @marklysze @sonichi @BabyCNM Update: sorry i've misunderstood the results, i think the comparison looks amazing. |
sounds great, let me add the optional CI. |
Signed-off-by: Mark Sze <[email protected]>
Why are these changes needed?
a draft PR for running the simple bench with ReasoningAgent and this PR is not meant to be merged.
source: https://simple-bench.com/
The benchmark results on the sample data (10 prompts) with the gpt-4o-mini is 20%.
Related issue number
Checks