-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #104 from lm-sys/routellm
Add RouteLLM blog post
- Loading branch information
Showing
15 changed files
with
136 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
--- | ||
title: "RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing" | ||
author: "Isaac Ong*, Amjad Almahairi*, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica" | ||
date: "July 1, 2024" | ||
previewImg: /images/blog/routellm/cover.png | ||
--- | ||
|
||
LLMs have demonstrated remarkable capabilities across a range of tasks, but there exists wide variation in their costs and capabilities, as seen from the plot of performance against cost in Figure 1. Very broadly, more capable models tend to be more expensive than less capable models. This leads to a dilemma when deploying LLMs in the real-world - routing all queries to the largest, most capable model leads to the highest-quality responses but can be expensive, while routing queries to smaller models can save costs but may result in lower-quality responses. | ||
|
||
<img src="/images/blog/routellm/main.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 50%"></img> | ||
|
||
<p style="color:gray; text-align: center;">Figure 1: Plot of performance against cost of various LLMs. Performance is measured by Elo on Chatbot Arena, and cost per million tokens assuming a 1:1 input / output ratio. Through routing between two models, we ideally achieve a better performance:cost ratio than can be achieved with either model.</p> | ||
|
||
*LLM routing* offers a solution to this problem, whereby each query is first processed by a system that decides which LLM to route it to. Ideally, the system should route all queries that can be sufficiently handled by weaker models to such models, and all other queries to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities before routing. | ||
|
||
To tackle this, we present **RouteLLM**, a principled framework for LLM routing based on preference data. We formalize the problem of LLM routing and explore augmentation techniques to improve router performance. We trained four different routers using public data from Chatbot Arena and demonstrate that they can significantly reduce costs without compromising quality, with **cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K** as compared to using only GPT-4, while still achieving 95% of GPT-4 performance. We also publicly release all our code and datasets, including a new [open-source framework](https://github.com/lm-sys/RouteLLM) for serving and evaluating LLM routers. | ||
|
||
## Routing Setup | ||
|
||
In our routing setup, we focus on the case where there are two models: a stronger, more expensive model, and a weaker but cheaper model. Given this setup, our objective is to minimize costs while achieving high quality by routing between both models. | ||
|
||
<img src="/images/blog/routellm/metrics.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 45%"></img> | ||
|
||
|
||
<p style="color:gray; text-align: center;">Figure 2: Random router performance on MT Bench</p> | ||
|
||
This is best understood through Figure 2, which represents the performance of a router that randomly routes between the two models on MT Bench. Specifically, we route between GPT-4 and Mixtral 8x7B here, with their performance denoted by the red and grey dotted lines respectively. For any router, we can plot a similar graph of its performance against the number of the calls made to GPT-4 (which is representative of the cost incurred since the cost of a Mixtral call is negligible). | ||
|
||
To train our routers, we use *preference data*, which each consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge. | ||
|
||
We trained four routers using a mix of Chatbot Arena data and data augmentation: | ||
- A similarity-weighted (SW) ranking router that performs a “weighted Elo calculation” based on similarity | ||
- A matrix factorization model that learns a scoring function for how well a model can answer a prompt | ||
- A BERT classifier that predicts which model can provide a better response | ||
- A causal LLM classifier that also predicts which model can provide a better response | ||
|
||
## Results | ||
|
||
We evaluated these routers on three popular benchmarks: [MT Bench](https://arxiv.org/abs/2306.05685), [MMLU](https://arxiv.org/abs/2009.03300), and [GSM8K](https://arxiv.org/abs/2110.14168), presenting results for MT Bench and MMLU below. For evaluation, we route between `gpt-4-1106-preview` as our strong model and `mixtral-8x7b-instruct-v0.1` as our weak model. We use the random router from before as our baseline. | ||
|
||
<br /> | ||
<figure style="text-align: center"> | ||
<img src="/images/blog/routellm/combined-mt-bench.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 90%"></img> | ||
</figure> | ||
|
||
<p style="color:gray; text-align: center;">Figure 3: Router performance on MT Bench (left) trained only on Arena data (right) trained on Arena data augmented using a LLM judge.</p> | ||
|
||
Figure 3 displays the performance of our routers on MT Bench. For routers trained only on the Arena dataset, we observe strong performance for both matrix factorization and SW ranking. Notably, matrix factorization is able to achieve 95% of GPT-4 performance using 26% GPT-4 calls, which is approximately 48% cheaper as compared to the random baseline. | ||
|
||
Augmenting the Arena data using an LLM judge leads to significant improvements across all routers. When trained on this augmented dataset, matrix factorization is again the best-performing router, with the number of GPT-4 calls required to achieve 95% GPT-4 performance further halved at 14% of total calls, 75% cheaper than the random baseline. | ||
|
||
<br /> | ||
<figure style="text-align: center"> | ||
<img src="/images/blog/routellm/combined-mmlu.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 90%"></img> | ||
</figure> | ||
|
||
|
||
<p style="color:gray; text-align: center;">Figure 4: Router performance on MMLU (left) trained only on Arena data (right) trained on Arena data augmented using golden-label data from the MMLU validation split.</p> | ||
|
||
Conversely, on MMLU in Figure 4, all routers perform poorly at a near-random level when trained only on the Arena dataset, which we attribute to most MMLU questions being out-of-distribution. However, augmenting the training dataset using golden-label data from the MMLU validation split leads to significant performance improvements across all routers, with our best-performing causal LLM router now requiring only 54% GPT-4 calls to achieve 95% of GPT-4 performance, 14% cheaper than the random baseline. Importantly, this augmented dataset of approximately 1500 samples represents less than 2% of the overall training data, demonstrating the effectiveness of data augmentation even when the number of samples is small. | ||
|
||
### Generalizing to Other Models | ||
|
||
While we route between GPT-4 and Mixtral for the above evaluations, to demonstrate the generalizability of our framework, we also present MT Bench results when routing between a different model pair: Claude 3 Opus and Llama 3 8B. Importantly, we use the same routers *without any retraining*, and responses from Claude 3 Opus and Llama 3 8B are not present in our training data. | ||
|
||
<br /> | ||
<img src="/images/blog/routellm/mt-bench-claude-llama.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 45%"></img> | ||
|
||
<p style="color:gray; text-align: center;">Figure 6: Router performance on MT Bench when routed to Claude 3 Opus and Llama 3 8B.</p> | ||
|
||
Even when the model pair is replaced, we observe strong results across all routers on MT Bench in Figure 6, with performance comparable to our original model pair. This suggests that our routers have learned some common characteristics of problems that can distinguish between strong and weak models, which generalize to new model pairs without additional training. | ||
|
||
### RouteLLM vs Commercial Offerings | ||
|
||
<br /> | ||
<figure style="text-align: center"> | ||
<img src="/images/blog/routellm/indep-benchmarks-llama.png" style="display:inline; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 46%"></img> | ||
<img src="/images/blog/routellm/indep-benchmarks.png" style="display:inline; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 45%"></img> | ||
</figure> | ||
|
||
<p style="color:gray; text-align: center;">Figure 7: Comparison of our router against existing routing systems on MT Bench (left) using gpt-4-turbo-2024-04-09 and llama-2-70b-chat (right) using gpt-4-turbo-2024-04-09 and mixtral-8x7b-instruct-v0.1 </p> | ||
|
||
In Figure 7, we also report the performance of our best-performing routers on MT Bench against [Martian](https://withmartian.com/) and [Unify AI](https://unify.ai/), two commercial LLM routing systems. We use `gpt-4-turbo-2024-04-09` as the strong model and `llama-2-70b-chat` or `mixtral-8x7b-instruct-v0.1` as the weak model depending on the models available. Our routers demonstrate very competitive results, achieving the same performance as these commercial routers while being up to 40% cheaper. | ||
|
||
## Conclusion | ||
|
||
These results demonstrate the ability of our routers to achieve significant cost savings while maintaining a high quality of responses. They also highlight the effectiveness of data augmentation in improving routing performance using only a small amount of data, offering a scalable path towards improving routing performance for real-world use cases. | ||
|
||
Based on our learnings from this research, we have created an open-source framework for serving and evaluating routers on [GitHub](https://github.com/lm-sys/RouteLLM). We are also releasing all our routers and datasets on [HuggingFace](https://huggingface.co/routellm) for public use. | ||
|
||
We are excited to see what you build on top of this! Please let us know if you face any issues or have any suggestions. For the full details, please refer to our [arXiv](https://arxiv.org/abs/2406.18665) paper. | ||
|
||
## Acknowledgements | ||
We are grateful to Tyler Griggs for his valuable feedback on this post. | ||
|
||
## Citations | ||
|
||
``` | ||
@misc{ong2024routellmlearningroutellms, | ||
title={RouteLLM: Learning to Route LLMs with Preference Data}, | ||
author={Isaac Ong and Amjad Almahairi and Vincent Wu and Wei-Lin Chiang and Tianhao Wu and Joseph E. Gonzalez and M Waleed Kadous and Ion Stoica}, | ||
year={2024}, | ||
eprint={2406.18665}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.LG}, | ||
url={https://arxiv.org/abs/2406.18665}, | ||
} | ||
@misc{chiang2024chatbot, | ||
title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, | ||
author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, | ||
year={2024}, | ||
eprint={2403.04132}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.AI} | ||
} | ||
@misc{ding2024hybridllmcostefficientqualityaware, | ||
title={Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing}, | ||
author={Dujian Ding and Ankur Mallick and Chi Wang and Robert Sim and Subhabrata Mukherjee and Victor Ruhle and Laks V. S. Lakshmanan and Ahmed Hassan Awadallah}, | ||
year={2024}, | ||
eprint={2404.14618}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.LG}, | ||
url={https://arxiv.org/abs/2404.14618}, | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.