Skip to content

Commit

Permalink
add blog
Browse files Browse the repository at this point in the history
  • Loading branch information
infwinston committed Mar 1, 2024
1 parent 9b7b827 commit c32f6d1
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 0 deletions.
50 changes: 50 additions & 0 deletions blog/2024-03-01-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: "LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation"
author: "LMSYS Arena Team"
date: "Mar 1, 2024"
previewImg: /images/blog/arena_policy/winrate_heatmap.png
---

## Our Mission

Chatbot Arena ([chat.lmsys.org](https://chat.lmsys.org)) is an open-source project developed by members from LMSYS and UC Berkeley [SkyLab](https://sky.cs.berkeley.edu/). Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We launch the evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://chat.lmsys.org/?leaderboard) periodically.


## Our Policy

**Open source**: The platform ([FastChat](https://github.com/lm-sys/FastChat)) including UI frontend, model serving backend and evaluation tools are all open source at GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard.

**Transparent**: The evaluation process, including rating computation, identifying anomalous users, and LLM selection are all made publicly available so others can reproduce our analysis and fully understand the process of collecting data. Furthermore, we will involve the community in deciding any changes in the evaluation process.

**Listing models on the leaderboard**: The leaderboard will only include models that are accessible to other third parties. In particular, the leaderboard will only include models that are either (1) open weights or/and (2) publicly available through APIs (e.g., gpt-4-0613, gemini-pro) or services (e.g., Bard, GPT-4+browsing).

Once the model is on the leaderboard, the model will remain accessible at [chat.lmsys.org](https://chat.lmsys.org) for at least **two weeks** for the community to evaluate it.

Before a model is published on the leaderboard, we need to accumulate enough votes to compute its rating. We host the model in the blind test mode. This is called the initial-rating phase. If the model provider decides to pull out and not show the model on the leaderboard, we will allow it, but we might still share with the community the data generated by the model during the initial-rating phase under the "anonymous" label.

To ensure the leaderboard correctly reflects model rankings over time, we rely on live comparisons between models. We may retire models from the leaderboard that are no longer online after a certain time period.


**Sharing data with the community**: We will periodically share data with the community. In particular, we will periodically share 20% of the arena vote data we have collected including the prompts, the answers, the identity of the model providing each answer (if the model is or has been on the leaderboard), and the votes. For the models we collected votes for but have never been on the leaderboard, we will still release data but we will label the model as "anonymous".

**Sharing data with the model providers**: Upon request, we will offer early data access with model providers who wish to improve their models. However, this data will be a subset of data that we periodically share with the community. In particular, with a model provider, we will share the data that includes their model's answers. For battles, we will not identify the opponent. We will label the opponent as "anonymous". This data will be later shared with the community during the periodic releases. If the model is not on the leaderboard at the time of sharing, the model’s answers will be labeled as "anonymous".

## FAQ

### Why another eval?
Most LLM benchmarks are static, which makes them prone to contamination, as these LLMs are trained on most available data on the Internet. Chatbot Arena aims to alleviate this problem by providing live evaluation with a continuous stream of new prompts from real people. We also believe that the open nature of the platform will attract users that accurately reflect the broader set of LLM users and real use cases.

### What model to evaluate? Why not all?
We will continuously add new models and retire old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process, i.e., it might take too much to accumulate enough votes to accurately rate each model. Today, the decision to add new models is rather ad-hoc: we add models based on the community’s perceived interest. We intend to formalize his process in the near future.

### Why should the community trust our eval?
We seek to provide transparency and all tools as well as the platform we are using in open-source. We invite the community to use our platform and tools to statistically reproduce our results.

### Why do you only share 20% of data, not all?
Arena's mission is to ensure trustable evaluation. We periodically share data to mitigate the potential risk of overfitting certain user distributions or preference biases in Arena. We will actively review this policy based on the community's feedback.

### Who will fund this effort? Any conflict of interests?
Chatbot Arena is only funded by gifts, in money, cloud credits, or API credits. The gifts have no strings attached.

## Any feedback?
Feel free to send us email or leave feedback on [Github](https://github.com/lm-sys/FastChat/issues)!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c32f6d1

Please sign in to comment.