diff --git a/blog/2024-05-08-llama3.md b/blog/2024-05-08-llama3.md index 48f8b383..fddbf25f 100644 --- a/blog/2024-05-08-llama3.md +++ b/blog/2024-05-08-llama3.md @@ -31,7 +31,7 @@ We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claud **Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a pre-established taxonomy of topics ([from Reka's paper](https://arxiv.org/pdf/2404.12387)) and visualize the win rate of Llama 3-70b against the other top models in Figure 1. We see that Llama 3’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples, this remains inconclusive. -**win rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are: +**Win Rate versus Prompt Difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are: @@ -60,7 +60,7 @@ We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claud
-We score 1000 battles against the top 3 models on the leaderboard and plot their win rates VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion. +We score 1000 battles against the top 3 models on the leaderboard and plot their win rates versus prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion.

Figure 2. Several top models' win rate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.

@@ -124,7 +124,7 @@ td {text-align: left} Llama 3-70B-Instruct 12,719 7,591 1.68 1 65 -Claude-3-opus-20240229 68,656 48,570 1.41 1 73 +Claude-3-Opus-20240229 68,656 48,570 1.41 1 73 All Models All Time 749,205 316,372 2.37 1 591 @@ -147,22 +147,22 @@ In order to limit the impact of users that vote many times, we can take the mean Llama 3-70B-Instruct 0.541 0.543 -Claude-3-opus-20240229 0.619 0.621 +Claude-3-Opus-20240229 0.619 0.621 -**Qualitative differences between Llama 3 outputs VS other models.** From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models. +**Qualitative differences between Llama 3 outputs versus other models.** From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models. **Measuring sentiment.** To measure excitement, we assign a binary label to each output based on the presence of an exclamation point. For positivity, friendliness, and conversationality, we use GPT-3.5 as a judge to rate each output on a scale of 1-5. In a given battle, Llama 3's outputs are labeled as more excited, positive, conversational, or friendly if their score is higher than the opponent's. Figure 5 displays the distribution of these qualities across models, revealing that Llama 3's outputs generally exhibit higher levels of excitement, positivity, friendliness, and conversationality as compared to their opponents. -

Figure 5: Proportion of arena prompts where Llama 3 is more positive/friendly/conversational/exclamatory than its opponent’s output

+

Figure 5: Proportion of arena prompts where Llama 3 is more positive/friendly/conversational/exclamatory than its opponent.

**Is sentiment related to win rate?** Figure 6 compares the sentiment qualities of Llama 3's outputs in battles it wins versus those it loses. We see that all traits appear more in winning battles and less in losing battles, but this difference is relatively small, especially for positivity and friendliness. This suggests that while these traits might play a role in competitive success, their influence requires further exploration for more definitive insights. -

Figure 6: Llama 3 sentiment VS win rate which llama is more positive/friendly/conversational/exclamatory than its opponent’s output.

+

Figure 6: Llama 3's sentiment versus its win rate when Llama 3 is more positive/friendly/conversational/exclamatory than its opponent.

## Conclusion From the beginning, our mission has been to advance LLM development and understanding. While in the past we have focused on high-level ranking and benchmark design, moving forward, we hope to extend the analysis here and conduct more in-depth analysis into changes in human preference as well as model behavior.