diff --git a/blog/2024-07-01-routellm.md b/blog/2024-07-01-routellm.md index fe5a125b..d8e7d94a 100644 --- a/blog/2024-07-01-routellm.md +++ b/blog/2024-07-01-routellm.md @@ -11,7 +11,7 @@ LLMs have demonstrated remarkable capabilities across a range of tasks, but ther

Figure 1: Plot of performance against cost of various LLMs. Performance is measured by Elo on Chatbot Arena, and cost per million tokens assuming a 1:1 input / output ratio. Through routing between two models, we ideally achieve a better performance:cost ratio than can be achieved with either model.

-*LLM routing* offers a solution to this, where each query is first processed by a system that decides which LLM to route it to. Ideally, all queries that can be handled by weaker models should be routed to these models, with all other queries routed to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities when routing. +LLM routing offers a solution to this, where each query is first processed by a system that decides which LLM to route it to. Ideally, all queries that can be handled by weaker models should be routed to these models, with all other queries routed to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities when routing. To tackle this, we present **RouteLLM**, a principled framework for LLM routing based on preference data. We formalize the problem of LLM routing and explore augmentation techniques to improve router performance. We trained four different routers using public data from Chatbot Arena and demonstrate that they can significantly reduce costs without compromising quality, with **cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K** as compared to using only GPT-4, while still achieving 95% of GPT-4’s performance. We also publicly release all our code and datasets, including a new [open-source framework](https://github.com/lm-sys/RouteLLM) for serving and evaluating LLM routers. @@ -26,7 +26,7 @@ In our routing setup, we focus on the case where there are two models: a stronge This is best understood through Figure 2, which represents the performance of a router that randomly routes between the two models on MT Bench. Specifically, we route between GPT-4 and Mixtral 8x7B here, with their performance denoted by the red and grey dotted lines respectively. For any router, we can plot a similar graph of its performance against the number of the calls made to GPT-4 (which is representative of the cost incurred since the cost of a Mixtral call is negligible). -To train our routers, we use *preference data*, expanding on [previous work](https://arxiv.org/abs/2404.14618) on routing. Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge. +We use *preference data* for training our routers, a novel approach as compared to [previous work](https://arxiv.org/abs/2404.14618). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge. We trained four routers using a mix of Chatbot Arena data and data augmentation: - A similarity-weighted (SW) ranking router that performs a “weighted Elo calculation” based on similarity @@ -59,17 +59,6 @@ Augmenting the Arena data using an LLM judge leads to significant improvements a Conversely, on MMLU in Figure 4, all routers perform poorly at a near-random level when trained only on the Arena dataset, which we attribute to most MMLU questions being out-of-distribution. However, augmenting the training dataset using golden-label data from the MMLU validation split leads to significant performance improvements across all routers, with our best-performing causal LLM router now requiring only 54% GPT-4 calls to achieve 95% of GPT-4 performance, 14% cheaper than the random baseline. Importantly, this augmented dataset of approximately 1500 samples represents less than 2% of the overall training data, demonstrating the effectiveness of data augmentation even when the number of samples is small. -### Generalizing to Other Models - -While we route between GPT-4 and Mixtral for the above evaluations, to demonstrate the generalizability of our framework, we also present MT Bench results when routing between a different model pair: Claude 3 Opus and Llama 3 8B. Importantly, we use the same routers *without any retraining*, and responses from Claude 3 Opus and Llama 3 8B are not present in our training data. - -
- - -

Figure 6: Router performance on MT Bench when routed to Claude 3 Opus and Llama 3 8B.

- -Even when the model pair is replaced, we observe strong results across all routers on MT Bench in Figure 6, with performance comparable to our original model pair. This suggests that our routers have learned some common characteristics of problems that can distinguish between strong and weak models, which generalize to new model pairs without additional training. - ### RouteLLM vs Commercial Offerings
@@ -78,13 +67,24 @@ Even when the model pair is replaced, we observe strong results across all route -

Figure 7: Comparison of our router against existing routing systems on MT Bench (left) using gpt-4-turbo-2024-04-09 and llama-2-70b-chat (right) using gpt-4-turbo-2024-04-09 and mixtral-8x7b-instruct-v0.1

+

Figure 6: Comparison of our router against existing routing systems on MT Bench (left) using gpt-4-turbo-2024-04-09 and llama-2-70b-chat (right) using gpt-4-turbo-2024-04-09 and mixtral-8x7b-instruct-v0.1

+ +In Figure 6, we also report the performance of our best-performing routers on MT Bench against [Martian](https://withmartian.com/) and [Unify AI](https://unify.ai/), two LLM routing products released by companies. We use the latest GPT-4 Turbo as the strong model and either Llama 2 70B or Mixtral 8x7B as the weak model based on the methodology detailed [here](https://github.com/lm-sys/RouteLLM/tree/main/benchmarks). Our routers demonstrate very strong results, achieving the same performance as these commercial routers while being over 40% cheaper. + +### Generalizing to Other Models + +While we route between GPT-4 and Mixtral for the above evaluations, to demonstrate the generalizability of our framework, we also present MT Bench results when routing between a different model pair: Claude 3 Opus and Llama 3 8B. Importantly, we use the same routers *without any retraining*, and responses from Claude 3 Opus and Llama 3 8B are not present in our training data. + +
+ + +

Figure 7: Router performance on MT Bench when routed to Claude 3 Opus and Llama 3 8B.

-In Figure 7, we also report the performance of our best-performing routers on MT Bench against [Martian](https://withmartian.com/) and [Unify AI](https://unify.ai/), two commercial LLM routing systems. We use `gpt-4-turbo-2024-04-09` as the strong model and `llama-2-70b-chat` or `mixtral-8x7b-instruct-v0.1` as the weak model depending on the models available as detailed [here](https://github.com/lm-sys/RouteLLM/tree/main/benchmarks). Our routers demonstrate very competitive results, achieving the same performance as these commercial routers while being up to 40% cheaper. +Even when the model pair is replaced, we observe strong results across all routers on MT Bench in Figure 7, with performance comparable to our original model pair. This suggests that our routers have learned some common characteristics of problems that can distinguish between strong and weak models, which generalize to new model pairs without additional training. ## Conclusion -These results demonstrate the ability of our routers to achieve significant cost savings while maintaining a high quality of responses. They also highlight the effectiveness of data augmentation in improving routing performance using only a small amount of data, offering a scalable path towards improving routing performance for real-world use cases. +These results demonstrate the ability of our routers to achieve significant cost savings while maintaining high-quality responses. They also highlight the effectiveness of data augmentation in improving routing performance using only a small amount of data, offering a scalable path towards improving routing performance for real-world use cases. Based on this research, we have created an open-source framework for serving and evaluating routers on [GitHub](https://github.com/lm-sys/RouteLLM). We are also releasing all our routers and datasets on [HuggingFace](https://huggingface.co/routellm) for public use.