Mistral Evals

This repository contains code to run evals released by Mistral AI as well as standardized prompts, parsing and metrics computation for popular academic benchmarks.

Installation

pip install -r requirements.txt

Evals

We support the following evals in this repository:

mm_mt_bench: MM-MT-Bench is a multi-turn LLM-as-a-judge evaluation task released by Mistral AI that uses GPT-4o for judging model answers given reference answers.
vqav2: VQAv2
docvqa: DocVQA
mathvista: MathVista
mmmu: MMMU
chartqa: ChartQA

Example usage:

Step 1: Host a model using vLLM

To install vLLM, follow the directions here.

>> vllm serve mistralai/Pixtral-12B-2409 --config_format mistral --tokenizer_mode "mistral"

Step 2: Evaluate hosted model.

>> python -m eval.run eval_vllm \
        --model_name mistralai/Pixtral-12B-2409 \
        --url http://0.0.0.0:8000 \
        --output_dir ~/tmp
        --eval_name "mm_mt_bench"

NOTE: Evaluating MM-MT-Bench requires calls to GPT-4o as a judge, hence you'll need to set the OPENAI_API_KEY environment variable for the eval to work.

For evaluating the other supported evals, see the Evals section.

Evaluating a non-vLLM model

To evaluate your own model, you can also create a Model class which implements a __call__ method which takes as input a chat completion request and returns a string answer. Requests are provided in vLLM API format.

class CustomModel(Model):

    def __call__(self, request: dict[str, Any]):
        # Your model code
        ...
        return answer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Mistral Evals

Installation

Evals

Example usage:

Evaluating a non-vLLM model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Mistral Evals

Installation

Evals

Example usage:

Evaluating a non-vLLM model