U-MATH and $\mu$-MATH evaluation code

This repository contains the official evaluation code for the U-MATH and $\mu$-MATH benchmarks. These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of Large Language Models (LLMs) on university-level problems.

Overview

U-MATH provides a set of 1,100 university-level mathematical problems, while µ-MATH complements it with a meta-evaluation framework focusing on solution judgment with 1084 LLM solutions.

U-MATH Evaluation Results

$\mu$-MATH Evaluation Results

Structure and Usage

This repository provides scripts for solving and evaluating the U-MATH and μ-MATH datasets.

File Structure

solve_u_math.py: Script to generate solutions for U-MATH problems using an OpenAI-compatible endpoint (e.g. gpt-4o or VLLM).
judge_u_math.py: Script to evaluate the correctness of U-MATH solutions.
judge_mu_math.py: Script to evaluate the quality of LLM judgments for μ-MATH solutions.
README.md: This file.
requirements.txt: List of dependencies required for running the scripts.

Download the repository and install the dependencies:

git clone https://github.com/toloka/u-math.git
cd u-math
pip install -r requirements.txt

Solve U-MATH Problems

To generate solutions for U-MATH problems, run the following command:

python solve_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --output_file predictions_u_math.json

Judge U-MATH Solutions

To evaluate the correctness of U-MATH solutions, run the following command:

python judge_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --predictions_file predictions_u_math.json --output_file judgments_u_math.json

Evaluate Judge on μ-MATH

To evaluate the quality of LLM judgments for μ-MATH solutions, run the following command:

python judge_u_math.py --base_url <BASE_URL> --api_key <YOUR_API_KEY> --model <MODEL_NAME> --output_file judgments_mu_math.json

Licensing Information

The contents of the μ-MATH's machine-generated model_output column are subject to the underlying LLMs' licensing terms.
Contents of all the other dataset U-MATH and μ-MATH fields, as well as the code, are available under the MIT license.

Citation

If you use U-MATH or μ-MATH in your research, please cite the paper:

@inproceedings{umath2024,
title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs},
author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga},
year={2024}
}

Contact

For inquiries, please contact [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

U-MATH and $\mu$-MATH evaluation code

Overview

U-MATH Evaluation Results

$\mu$-MATH Evaluation Results

Structure and Usage

Solve U-MATH Problems

Judge U-MATH Solutions

Evaluate Judge on μ-MATH

Licensing Information

Citation

Contact

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
judge_mu_math.py		judge_mu_math.py
judge_u_math.py		judge_u_math.py
prompts.py		prompts.py
requirements.txt		requirements.txt
solve_u_math.py		solve_u_math.py

License

Toloka/u-math

Folders and files

Latest commit

History

Repository files navigation

U-MATH and $\mu$-MATH evaluation code

Overview

U-MATH Evaluation Results

$\mu$-MATH Evaluation Results

Structure and Usage

Solve U-MATH Problems

Judge U-MATH Solutions

Evaluate Judge on μ-MATH

Licensing Information

Citation

Contact

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages