Skip to content

Commit

Permalink
Merge pull request #108 from lm-sys/routellm-updates
Browse files Browse the repository at this point in the history
Add link for RouteLLM
  • Loading branch information
iojw authored Jul 1, 2024
2 parents c52865d + 3696a0f commit 1bf16df
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions blog/2024-07-01-routellm.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ In our routing setup, we focus on the case where there are two models: a stronge

This is best understood through Figure 2, which represents the performance of a router that randomly routes between the two models on MT Bench. Specifically, we route between GPT-4 and Mixtral 8x7B here, with their performance denoted by the red and grey dotted lines respectively. For any router, we can plot a similar graph of its performance against the number of the calls made to GPT-4 (which is representative of the cost incurred since the cost of a Mixtral call is negligible).

We use *preference data* for training our routers, a novel approach as compared to [previous work](https://arxiv.org/abs/2404.14618). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge.
We use *preference data* for training our routers, building upon previous works ([1](https://arxiv.org/abs/2404.14618),[2](https://huyenchip.com/2024/02/28/predictive-human-preference.html)). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge.

We trained four routers using a mix of Chatbot Arena data and data augmentation:
- A similarity-weighted (SW) ranking router that performs a “weighted Elo calculation” based on similarity
Expand All @@ -40,7 +40,8 @@ We trained four routers using a mix of Chatbot Arena data and data augmentation:

## Results

We evaluated these routers on three popular benchmarks: [MT Bench](https://arxiv.org/abs/2306.05685), [MMLU](https://arxiv.org/abs/2009.03300), and [GSM8K](https://arxiv.org/abs/2110.14168), presenting results for MT Bench and MMLU below. For evaluation, we route between `gpt-4-1106-preview` as our strong model and `mixtral-8x7b-instruct-v0.1` as our weak model. We use the random router from before as our baseline.
We evaluated these routers on three popular benchmarks: [MT Bench](https://arxiv.org/abs/2306.05685), [MMLU](https://arxiv.org/abs/2009.03300), and [GSM8K](https://arxiv.org/abs/2110.14168), presenting results for MT Bench and MMLU below. For evaluation, we route between GPT-4 Turbo as our strong model and Mixtral 8x7B as our weak model. We use the random router from before as our baseline.


<br />
<figure style="text-align: center">
Expand Down

0 comments on commit 1bf16df

Please sign in to comment.