diff --git a/blog/2024-07-01-routellm.md b/blog/2024-07-01-routellm.md index 656a2867..11f5d780 100644 --- a/blog/2024-07-01-routellm.md +++ b/blog/2024-07-01-routellm.md @@ -30,7 +30,7 @@ In our routing setup, we focus on the case where there are two models: a stronge This is best understood through Figure 2, which represents the performance of a router that randomly routes between the two models on MT Bench. Specifically, we route between GPT-4 and Mixtral 8x7B here, with their performance denoted by the red and grey dotted lines respectively. For any router, we can plot a similar graph of its performance against the number of the calls made to GPT-4 (which is representative of the cost incurred since the cost of a Mixtral call is negligible). -We use *preference data* for training our routers, a novel approach as compared to [previous work](https://arxiv.org/abs/2404.14618). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge. +We use *preference data* for training our routers, building upon previous works ([1](https://arxiv.org/abs/2404.14618),[2](https://huyenchip.com/2024/02/28/predictive-human-preference.html)). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge. We trained four routers using a mix of Chatbot Arena data and data augmentation: - A similarity-weighted (SW) ranking router that performs a “weighted Elo calculation” based on similarity @@ -40,7 +40,8 @@ We trained four routers using a mix of Chatbot Arena data and data augmentation: ## Results -We evaluated these routers on three popular benchmarks: [MT Bench](https://arxiv.org/abs/2306.05685), [MMLU](https://arxiv.org/abs/2009.03300), and [GSM8K](https://arxiv.org/abs/2110.14168), presenting results for MT Bench and MMLU below. For evaluation, we route between `gpt-4-1106-preview` as our strong model and `mixtral-8x7b-instruct-v0.1` as our weak model. We use the random router from before as our baseline. +We evaluated these routers on three popular benchmarks: [MT Bench](https://arxiv.org/abs/2306.05685), [MMLU](https://arxiv.org/abs/2009.03300), and [GSM8K](https://arxiv.org/abs/2110.14168), presenting results for MT Bench and MMLU below. For evaluation, we route between GPT-4 Turbo as our strong model and Mixtral 8x7B as our weak model. We use the random router from before as our baseline. +