Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReasoningAgent benchmarking with SimpleBench #293

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Conversation

Hk669
Copy link
Collaborator

@Hk669 Hk669 commented Dec 26, 2024

Why are these changes needed?

a draft PR for running the simple bench with ReasoningAgent and this PR is not meant to be merged.
source: https://simple-bench.com/

The benchmark results on the sample data (10 prompts) with the gpt-4o-mini is 20%.

Related issue number

Checks

@sonichi
Copy link
Collaborator

sonichi commented Dec 26, 2024

Thanks. How about adding the test into the contrib-openai CI?


if match:
extracted_answer = match.group(1)
results.append({"question_id": question_id, "answer": answer, "generated_answer": extracted_answer})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also cache the ans and summary for debugging purposes.

Also, let's record the accuracy metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ive tried to cache the summary directly with the results.json, let me know if this works or is there anything im missing?

Copy link
Collaborator

@BabyCNM BabyCNM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There are some small issues (filename mismatch) that the code would not run.

@Hk669 Hk669 marked this pull request as ready for review January 1, 2025 10:55
@Hk669
Copy link
Collaborator Author

Hk669 commented Jan 1, 2025

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark?
fyi: the ci tests for the reasoningagent are under process in the PR #294

@Hk669 Hk669 requested a review from BabyCNM January 1, 2025 10:58
@sonichi
Copy link
Collaborator

sonichi commented Jan 1, 2025

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR #294

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

@marklysze
Copy link
Collaborator

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

@Hk669
Copy link
Collaborator Author

Hk669 commented Jan 2, 2025

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

the numbers are really interesting, i was thinking lets not add in the chat UI results in the benchmarking of the ReasoningAgent, lets just restrict the results to only the reasoning agent?

cc @marklysze @sonichi @BabyCNM

Update: sorry i've misunderstood the results, i think the comparison looks amazing.

@Hk669
Copy link
Collaborator Author

Hk669 commented Jan 2, 2025

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

sounds great, let me add the optional CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants