Skip to content

Commit

Permalink
More fixes for Llama 3 blog (#84)
Browse files Browse the repository at this point in the history
* More fixes for Llama 3 blog

* More fixes
  • Loading branch information
iojw authored May 9, 2024
1 parent 9caec75 commit 390b9f9
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions blog/2024-05-08-llama3.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claud

**Topic Analysis.** We utilize an LLM labeler (Llama 3-70b) to categorize user prompts into a pre-established taxonomy of topics ([from Reka's paper](https://arxiv.org/pdf/2404.12387)) and visualize the win rate of Llama 3-70b against the other top models in Figure 1. We see that Llama 3’s win rate is highest for open-ended and creative tasks like brainstorming and writing, and lowest for more close-ended technical tasks like math and translation. Interestingly, Llama 3 achieves the highest win rate over data processing tasks which mainly consist of parsing and dataframe operations, but as this category has only 19 examples, this remains inconclusive.

**win rate VS prompt difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are:
**Win Rate versus Prompt Difficulty.** We employ our [recently released pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) which scores the difficulty of prompts to determine how Llama 3 compares to the other top models as prompts get harder. We define a set of "hardness" criteria and use GPT-4-turbo to annotate each prompt from 0 to 7 to indicate how many of these criteria are satisfied (a higher score indicates a harder prompt). Our 7 criteria are:

<table style="width:100%; border-collapse: collapse; border: 1px solid black;">
<tr style="background-color: black; color: white;">
Expand Down Expand Up @@ -60,7 +60,7 @@ We focus on battles consisting of Llama 3-70b against 5 top-ranked models (claud
</tr>
</table>

We score 1000 battles against the top 3 models on the leaderboard and plot their win rates VS prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion.
We score 1000 battles against the top 3 models on the leaderboard and plot their win rates versus prompt score in Figure 2. We observe a significant drop in Llama 3's performance compared to the other top models, from a high 50% win rate to a low 40% win rate. We conclude that as more of these "hardness" criteria are met, Llama 3's win rate drop rapidly compared to other models. Note that these criteria may not be exhaustive, see [the blog](https://lmsys.org/blog/2024-04-19-arena-hard/) for further discussion.

<img src="/images/blog/llama3/winrate-over-criteria.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%"></img>
<p style="color:gray; text-align: center;">Figure 2. Several top models' win rate against the strongest 6 models over the intervals of number of key criteria satisfied. *English battles between strongest models: llama-3-70b-chat, claude-3-opus-20240229, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo-2024-04-09, gemini-1.5-pro-api-0409-preview.</p>
Expand Down Expand Up @@ -124,7 +124,7 @@ td {text-align: left}
<td>Llama 3-70B-Instruct</td> <td>12,719</td> <td>7,591</td> <td>1.68</td> <td>1</td> <td>65</td>
</tr>
<tr>
<td>Claude-3-opus-20240229</td> <td>68,656</td> <td>48,570</td> <td>1.41</td> <td>1</td> <td>73</td>
<td>Claude-3-Opus-20240229</td> <td>68,656</td> <td>48,570</td> <td>1.41</td> <td>1</td> <td>73</td>
</tr>
<tr>
<td>All Models All Time</td> <td>749,205</td> <td>316,372</td> <td>2.37</td> <td>1</td> <td>591</td>
Expand All @@ -147,22 +147,22 @@ In order to limit the impact of users that vote many times, we can take the mean
<td>Llama 3-70B-Instruct</td> <td>0.541</td> <td>0.543</td>
</tr>
<tr>
<td>Claude-3-opus-20240229</td> <td>0.619</td> <td>0.621</td>
<td>Claude-3-Opus-20240229</td> <td>0.619</td> <td>0.621</td>
</tr>
</tbody>
</table>

**Qualitative differences between Llama 3 outputs VS other models.** From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models.
**Qualitative differences between Llama 3 outputs versus other models.** From qualitative analysis of outputs between Llama 3 and other models, we observe that Llama 3 outputs are often more excited, positive, conversational, and friendly than other models.

**Measuring sentiment.** To measure excitement, we assign a binary label to each output based on the presence of an exclamation point. For positivity, friendliness, and conversationality, we use GPT-3.5 as a judge to rate each output on a scale of 1-5. In a given battle, Llama 3's outputs are labeled as more excited, positive, conversational, or friendly if their score is higher than the opponent's. Figure 5 displays the distribution of these qualities across models, revealing that Llama 3's outputs generally exhibit higher levels of excitement, positivity, friendliness, and conversationality as compared to their opponents.

<img src="/images/blog/llama3/llama_sentiment_distribution.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
<p style="color:gray; text-align: center;">Figure 5: Proportion of arena prompts where Llama 3 is more positive/friendly/conversational/exclamatory than its opponent’s output</p>
<p style="color:gray; text-align: center;">Figure 5: Proportion of arena prompts where Llama 3 is more positive/friendly/conversational/exclamatory than its opponent.</p>

**Is sentiment related to win rate?** Figure 6 compares the sentiment qualities of Llama 3's outputs in battles it wins versus those it loses. We see that all traits appear more in winning battles and less in losing battles, but this difference is relatively small, especially for positivity and friendliness. This suggests that while these traits might play a role in competitive success, their influence requires further exploration for more definitive insights.

<img src="/images/blog/llama3/sentiment_win_rate.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%"></img>
<p style="color:gray; text-align: center;">Figure 6: Llama 3 sentiment VS win rate which llama is more positive/friendly/conversational/exclamatory than its opponent’s output.</p>
<p style="color:gray; text-align: center;">Figure 6: Llama 3's sentiment versus its win rate when Llama 3 is more positive/friendly/conversational/exclamatory than its opponent.</p>

## Conclusion
From the beginning, our mission has been to advance LLM development and understanding. While in the past we have focused on high-level ranking and benchmark design, moving forward, we hope to extend the analysis here and conduct more in-depth analysis into changes in human preference as well as model behavior.
Expand Down

0 comments on commit 390b9f9

Please sign in to comment.