Skip to content

Toloka/u-math

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

U-MATH and $\mu$-MATH evaluation code

This repository contains the official evaluation code for the U-MATH and $\mu$-MATH benchmarks. These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of Large Language Models (LLMs) on university-level problems.

Overview

U-MATH provides a set of 1,100 university-level mathematical problems, while µ-MATH complements it with a meta-evaluation framework focusing on solution judgment with 1084 LLM solutions.

U-MATH Evaluation Results

umath-table
umath-bar

$\mu$-MATH Evaluation Results

mumath-table
mumath-scatter

Structure and Usage

This repository provides scripts for solving and evaluating the U-MATH and μ-MATH datasets.

File Structure

  • solve_u_math.py: Script to generate solutions for U-MATH problems using an OpenAI-compatible endpoint (e.g. gpt-4o or VLLM).
  • judge_u_math.py: Script to evaluate the correctness of U-MATH solutions.
  • judge_mu_math.py: Script to evaluate the quality of LLM judgments for μ-MATH solutions.
  • README.md: This file.
  • requirements.txt: List of dependencies required for running the scripts.

Download the repository and install the dependencies:

git clone https://github.com/toloka/u-math.git
cd u-math
pip install -r requirements.txt

Solve U-MATH Problems

To generate solutions for U-MATH problems, run the following command:

python solve_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --output_file predictions_u_math.json

Judge U-MATH Solutions

To evaluate the correctness of U-MATH solutions, run the following command:

python judge_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --predictions_file predictions_u_math.json --output_file judgments_u_math.json

Evaluate Judge on μ-MATH

To evaluate the quality of LLM judgments for μ-MATH solutions, run the following command:

python judge_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --output_file judgments_mu_math.json

Licensing Information

  • The contents of the μ-MATH's machine-generated model_output column are subject to the underlying LLMs' licensing terms.
  • Contents of all the other dataset U-MATH and μ-MATH fields, as well as the code, are available under the MIT license.

Citation

If you use U-MATH or μ-MATH in your research, please cite the paper:

@inproceedings{umath2024,
title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs},
author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga},
year={2024}
}

Contact

For inquiries, please contact [email protected]

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages