From afb81eba4d982ceb3841f28144f74507ee8b65f3 Mon Sep 17 00:00:00 2001 From: Lisa Dunlap Date: Thu, 27 Jun 2024 20:10:36 -0700 Subject: [PATCH] bug fixes --- blog/2024-06-27-multimodal.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/blog/2024-06-27-multimodal.md b/blog/2024-06-27-multimodal.md index db55cc63..1f6c71a5 100644 --- a/blog/2024-06-27-multimodal.md +++ b/blog/2024-06-27-multimodal.md @@ -121,13 +121,13 @@ td {text-align: left}
-

Table 1. Multimodal Arena Leaderboard (Timeframe: June 10th - June 22, 2023). Total votes = 12,827. The latest and detailed version here.

+

Table 1. Multimodal Arena Leaderboard (Timeframe: June 10th - June 25th, 2024). Total votes = 12,827. The latest and detailed version here.

- + @@ -136,7 +136,7 @@ td {text-align: left} - + @@ -170,7 +170,7 @@ multimodal leaderboard ranking aligns closely with the LLM leaderboard, but with As a small note, you might also notice that the “Elo rating” column from earlier Arena leaderboards has been renamed to “Arena score.” Rest assured: nothing has changed in the way we compute this quantity; we just renamed it. (The reason for the change is that we were computing the Bradley-Terry coefficients, which are slightly different from the Elo score, and wanted to avoid future confusion.) You should think of the Arena score as a measure of *model strength*. If model A has an Arena score $s_A$ and model B has an arena score $s_B$, you can calculate the win rate of model A over model B as -$$\mathbb{P}A (\text{ beats } B) = \frac{1}{1 + e^{\frac{s_B - s_A}{400}}},$$ +$$\mathbb{P}(A \text{ beats } B) = \frac{1}{1 + e^{\frac{s_B - s_A}{400}}},$$ where the number 400 is an arbitrary scaling factor that we chose in order to display the Arena score in a more human-readable format (as whole numbers). For additional information on how the leaderboard is computed, please see [this notebook](https://colab.research.google.com/drive/1eNPrurghAWlNB1H5uyW244hoVpsvWInc?usp=sharing ).
Rank Model Arena Score 95% CI Votes
1 GPT-4o 1226 +7/-7 3878
1 GPT-4o 1226 +7/-7 3878
2 Claude 3.5 Sonnet 1209 +5/-6 5664
3 Gemini 1.5 Pro 1171 +10/-6 3851
3 GPT-4 Turbo 1167 +10/-9 3385
3 GPT-4 Turbo 1167 +10/-9 3385
5 Claude 3 Opus 1084 +8/-7 3988