Skip to content

Commit

Permalink
Release MT-bench code (lm-sys#1722)
Browse files Browse the repository at this point in the history
Co-authored-by: Wei-Lin Chiang <[email protected]>
Co-authored-by: Ying Sheng <[email protected]>
  • Loading branch information
3 people authored Jun 16, 2023
1 parent 0c65af1 commit b494d0c
Show file tree
Hide file tree
Showing 16 changed files with 2,019 additions and 1 deletion.
Binary file added assets/qa_browser.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions fastchat/constants.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from enum import IntEnum
import os

REPO_PATH = os.path.dirname(os.path.dirname(__file__))

##### For the gradio web server
SERVER_ERROR_MSG = (
Expand Down
178 changes: 178 additions & 0 deletions fastchat/llm_judge/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# LLM Judge
| [Paper](https://arxiv.org/abs/2306.05685) |

In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge.

## Contents
- [Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments)
- [MT-Bench](#mt-bench)
- [Release Plan](#release-plan)

## Installation
```
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e .
```

## Review Pre-Generated Model Answers and Judgments
The model answers and LLM judgments used in the paper are available on Google Drive.
You can download them and open a gradio demo to review them.

- Download the data:
```
cd FastChat/fastchat/llm_judge
pip3 install gdown
gdown --fuzzy https://drive.google.com/file/d/1LNOc7NAc7BXM1LMhRlorsrMu38G9yoHT/view?usp=sharing
tar xzf llm_judge_repo_data.tar.gz
```
- Open a gradio demo website for browsing the questions, answers, and judgments.
```
python qa_browser.py --share
```

A screenshot:
<img src="../../assets/qa_browser.png" width="90%">

## MT-Bench

### How to evaluate a model on MT-bench?

#### Step 1. Generate model answers to MT-bench questions
```
python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]
```

Note: `[MODEL-PATH]` can be any huggingface model path.
e.g.,
```
python gen_model_answer.py --model-path lmsys/fastchat-t5-3b-v1.0 --model-id fastchat-t5-3b-v1.0
```
The answers will be saved to `data/mt_bench/model_answer/[MODEL-ID].jsonl`.

You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs.

#### Step 2. Run GPT-4 judge with pairwise comparison against a baseline (default: gpt-3.5-turbo)
```
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
```

e.g.,
```
> python gen_judgment.py --model-list vicuna-13b-v1.2 alpaca-13b gpt-3.5-turbo --parallel 2
Stats:
{
"bench": "mt_bench",
"mode": "pairwise-baseline",
"judge": "gpt-4",
"baseline": "gpt-3.5-turbo",
"model_list": [
"vicuna-13b-v1.2",
"alpaca-13b",
"gpt-3.5-turbo",
],
"total_num_questions": 80,
"total_num_matches": 320,
"output_path": "data/mt_bench/model_judgment/gpt-4_pair.jsonl"
}
Press Enter to confirm...
```

The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_pair.jsonl`

#### Setp 3. Show win-rate
```
> python show_result.py
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
win loss tie win_rate loss_rate
model
gpt-4 107 9 44 0.66875 0.05625
claude-v1 64 23 73 0.40000 0.14375
vicuna-13b-v1.2 21 72 67 0.13125 0.45000
alpaca-13b 5 129 26 0.03125 0.80625
llama-13b 1 139 20 0.00625 0.86875
```

### Other grading options
Besides pairwise comparison against a fixed baseline model, we also support two additional grading options:
- `single`: do single-answer grading without pairwise comparison.
- `pairwise-all`: run pairwise comparisons between all model pairs on all questions.

#### Option 2: Single-answer grading

Another scalable option is to let GPT-4 grade and give a score to a single answer without comparison.

- Generate GPT-4 judgments
```
python gen_judgment.py --mode single --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
Stats:
{
"bench": "mt_bench",
"mode": "single",
"judge": "gpt-4",
"baseline": null,
"model_list": [
"vicuna-13b-v1.2",
"llama-13b",
"alpaca-13b",
"gpt-3.5-turbo",
"gpt-4",
"claude-v1"
],
"total_num_questions": 80,
"total_num_matches": 960,
"output_path": "data/mt_bench/model_judgment/gpt-4_single.jsonl"
}
```
The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_single.jsonl`
- Show the MT-bench score
```
> python show_result.py --mode single
score
model
gpt-4 8.937500
gpt-3.5-turbo 7.925000
claude-v1 7.503125
vicuna-13b-v1.2 6.156250
alpaca-13b 4.918750
llama-13b 3.190625
```

#### Option 3: Run GPT-4 judge with all pair comparisons

Another option is to run all pairs comparison.
This could be more expensive when #models increases, but it gives you a more comprehensive information.

```
> python gen_judgment.py --mode pairwise-all --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
```

```
> python show_result.py --mode pairwise-all
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
win loss tie win_rate loss_rate
model
gpt-4 617 45 138 0.77125 0.05625
claude-v1 445 115 240 0.55625 0.14375
gpt-3.5-turbo 372 198 230 0.46500 0.24750
vicuna-13b-v1.2 242 310 248 0.30250 0.38750
alpaca-13b 104 515 181 0.13000 0.64375
llama-13b 20 617 163 0.02500 0.77125
```

### How to get GPT-3.5/GPT-4/Claude's answer?
- `python gen_api_answer.py --model [MODEL-NAME]` to generate GPT-3.5/4 and Claude's answers.

## Release Plan
Our first release contains:
- The MT-bench questions in [data/mt_bench/question.jsonl](data/mt_bench/question.jsonl).
- The model answers and GPT-4 judgments available on Google Drive.
- The judge prompts in [data/judge_prompts.jsonl](data/judge_prompts.jsonl).

The next release will include:
- All data
- 3K expert votes
- 30K arena conversations with human votes
- All code
- computing agreement between judges
- others
Loading

0 comments on commit b494d0c

Please sign in to comment.