ReasoningAgent benchmarking with SimpleBench #293

Hk669 · 2024-12-26T14:52:01Z

Why are these changes needed?

a draft PR for running the simple bench with ReasoningAgent and this PR is not meant to be merged.
source: https://simple-bench.com/

The benchmark results on the sample data (10 prompts) with the gpt-4o-mini is 20%.

Related issue number

Checks

I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/docs/contributor-guide/documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

sonichi · 2024-12-26T17:12:58Z

Thanks. How about adding the test into the contrib-openai CI?

benchmark/run_simple-bench.py

BabyCNM · 2024-12-31T18:57:46Z

benchmark/run_simple-bench.py

+
+        if match:
+            extracted_answer = match.group(1)
+            results.append({"question_id": question_id, "answer": answer, "generated_answer": extracted_answer})


Let's also cache the ans and summary for debugging purposes.

Also, let's record the accuracy metrics.

ive tried to cache the summary directly with the results.json, let me know if this works or is there anything im missing?

BabyCNM

LGTM. There are some small issues (filename mismatch) that the code would not run.

Hk669 · 2025-01-01T10:58:28Z

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark?
fyi: the ci tests for the reasoningagent are under process in the PR #294

sonichi · 2025-01-01T20:36:59Z

Thanks. How about adding the test into the contrib-openai CI?

can you please mention if it is for the ReasoningAgent or for the Benchmark? fyi: the ci tests for the reasoningagent are under process in the PR #294

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

Signed-off-by: Mark Sze <[email protected]>

marklysze · 2025-01-01T23:26:53Z

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

Hk669 · 2025-01-02T04:58:54Z

I've tested with Anthropic, Gemini, DeepSeek, committed a summary file.

See here.

The strongest results are from Anthropic (and Anthropic's chat UI scored the highest).

the numbers are really interesting, i was thinking lets not add in the chat UI results in the benchmarking of the ReasoningAgent, lets just restrict the results to only the reasoning agent?

cc @marklysze @sonichi @BabyCNM

Update: sorry i've misunderstood the results, i think the comparison looks amazing.

Hk669 · 2025-01-02T05:02:11Z

I mean, we can add simplebench performance check as an optional CI for reasoning agent. It's only triggered when necessary and requires approval.

sounds great, let me add the optional CI.

Signed-off-by: Mark Sze <[email protected]>

ReasoningAgent benchmarking with SimpleBench

ff4d629

Hk669 requested review from sonichi, qingyun-wu and BabyCNM December 26, 2024 14:52

Add o1-mini's detailed performance

02d5cbe

BabyCNM reviewed Dec 31, 2024

View reviewed changes

benchmark/run_simple-bench.py Outdated Show resolved Hide resolved

BabyCNM reviewed Dec 31, 2024

View reviewed changes

benchmark/run_simple-bench.py Outdated Show resolved Hide resolved

BabyCNM reviewed Dec 31, 2024

View reviewed changes

BabyCNM requested changes Dec 31, 2024

View reviewed changes

Hk669 added 2 commits January 1, 2025 10:33

move the files into simple-bench

b2e1b1e

required changes, cache, summary and method

716eba7

Hk669 marked this pull request as ready for review January 1, 2025 10:55

Hk669 requested a review from BabyCNM January 1, 2025 10:58

add the method property for the reasoningagent

d01d841

Hk669 requested a deployment to openai1 January 1, 2025 11:21 — with GitHub Actions Waiting

Add results for Anthropic, Gemini, DeepSeek V3

7232dbb

Signed-off-by: Mark Sze <[email protected]>

marklysze requested a deployment to openai1 January 1, 2025 23:25 — with GitHub Actions Waiting

Google Gemini 2.0 Flash Thinking through API

1233c00

Signed-off-by: Mark Sze <[email protected]>

marklysze requested a deployment to openai1 January 2, 2025 05:52 — with GitHub Actions Waiting

marklysze requested a deployment to openai1 January 2, 2025 05:53 — with GitHub Actions Waiting

marklysze marked this pull request as draft January 2, 2025 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReasoningAgent benchmarking with SimpleBench #293

ReasoningAgent benchmarking with SimpleBench #293

Hk669 commented Dec 26, 2024 •

edited

Loading

sonichi commented Dec 26, 2024

BabyCNM Dec 31, 2024

Hk669 Jan 1, 2025

BabyCNM left a comment

Hk669 commented Jan 1, 2025

sonichi commented Jan 1, 2025

marklysze commented Jan 1, 2025

Hk669 commented Jan 2, 2025 •

edited

Loading

Hk669 commented Jan 2, 2025

ReasoningAgent benchmarking with SimpleBench #293

Are you sure you want to change the base?

ReasoningAgent benchmarking with SimpleBench #293

Conversation

Hk669 commented Dec 26, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

sonichi commented Dec 26, 2024

BabyCNM Dec 31, 2024

Choose a reason for hiding this comment

Hk669 Jan 1, 2025

Choose a reason for hiding this comment

BabyCNM left a comment

Choose a reason for hiding this comment

Hk669 commented Jan 1, 2025

sonichi commented Jan 1, 2025

marklysze commented Jan 1, 2025

Hk669 commented Jan 2, 2025 • edited Loading

Hk669 commented Jan 2, 2025

Hk669 commented Dec 26, 2024 •

edited

Loading

Hk669 commented Jan 2, 2025 •

edited

Loading