-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
- Loading branch information
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"pageProps":{"frontmatter":{"title":"RedTeam Arena: An Open-Source, Community-driven Jailbreaking Platform","author":"Anastasios Angelopoulos*, Luca Vivona*, Wei-Lin Chiang*, Aryan Vichare, Lisa Dunlap, Salvivona, Pliny, Ion Stoica","date":"Sep 13, 2024","previewImg":"/images/blog/redteam_arena/badwords.png"},"content":"\nWe are excited to launch [RedTeam Arena](https://redarena.ai), a community-driven redteaming platform, built in collaboration with [Pliny](https://x.com/elder_plinius) and the [BASI](https://discord.gg/Y6GxC59G) community!\n\n\n\n<img src=\"/images/blog/redteam_arena/badwords.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 80%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 1: RedTeam Arena with Bad Words at <a href=\"https://redarena.ai\">redarena.ai</a></p>\n\nRedTeam Arena is an [open-source](https://github.com/redteaming-arena/redteam-arena) red-teaming platform for LLMs. Our plan is to provide games that people can play to have fun, while sharpening their red-teaming skills. The first game we created is called *[Bad Words](https://redarena.ai)*, challenging players to convince models to say target \"bad words”. It already has strong community adoption, with thousands of users participating and competing for the top spot on the jailbreaker leaderboard.\n\nWe plan to open the data after a short responsible disclosure delay. We hope this data will help the community determine the boundaries of AI models—how they can be controlled and convinced.\n\nThis is not a bug bounty program, and it is not your grandma’s jailbreak arena. Our goal is to serve and grow the redteaming community. To make this one of the most massive crowdsourced red teaming initiatives of all time. From our perspective, models that are easily persuaded are not worse: they are just more controllable, and less resistant to persuasion. This can be good or bad depending on your use-case; it’s not black-and-white.\n\nWe need your help. Join our jailbreaking game at [redarena.ai](redarena.ai). All the code is open-sourced on [Github](https://github.com/redteaming-arena/redteam-arena). You can open issues and also send feedback on [Discord](https://discord.gg/mP3PwbKG9m). You are welcome to propose new games, or new bad words on X (just tag @[lmsysorg](https://x.com/lmsysorg) and @[elder_plinius](https://x.com/elder_plinius) so we see it)!\n\n\n## The Leaderboard: Extended Elo\n\n<img src=\"/images/blog/redteam_arena/leaderboard.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 100%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 2. Leaderboard screenshot. Latest version at <a href=\"https://redarena.ai/leaderboard\">redarena.ai/leaderboard</a></p>\n\nPeople have been asking how we compute the leaderboard of players, models, and prompts. The idea is to treat every round of Bad Words as a 1v1 game between a player and a (prompt, model) combination, and calculate the corresponding Elo score. Doing this naively is sample-inefficient and would result in slow convergence, so we instead designed a new statistical method for this purpose (writeup coming!) and we’ll describe it below.\n\n*Observation model.* Let $T$ be the number of battles (“time-steps”), $M$ be the number of models, $P$ be the number of players, and $R$ be the number of prompts. For each battle $i \\in [n]$, we have a player, a model, and a prompt, encoded as following:\n\n* $X_i^{\\rm Model} \\in \\{0,1\\}^M$, a one-hot vector with 1 on the entry of the model sampled in battle $i$.\n* $X_i^{\\rm Player} \\in \\{0,1\\}^P$, a one-hot vector with 1 on the entry of the player in battle $i$.\n* $X_i^{\\rm Prompt} \\in \\{0,1\\}^R$, a one-hot vector with 1 on the entry of the prompt sampled in battle $i$.\n* $Y_i \\in \\{0,1\\}$, a binary outcome taking the value 1 if the player won (or forfeited) and 0 otherwise.\n\nWe then model the win probability of the player as\n\\begin{equation}\n\t\\mathbb{P}(Y_i = 1) = \\frac{e^{X_i^{\\rm Player}\\beta^{\\rm Player}}}{e^{X_i^{\\rm Player}\\beta^{\\rm Player}} + e^{X_i^{\\rm Model}\\beta^{\\rm Model} + X_i^{\\rm Prompt}\\beta^{\\rm Prompt}}}.\n\\end{equation}\nThis form might look familiar, since it is the same type of model as the Arena Score: a logistic model. This is just a logistic model with a different, _additive_ structure—the model scores $\\beta^{\\rm Model}$ and prompt scores $\\beta^{\\rm Prompt}$ combine additively to generate a notion of total strength for the model-prompt pair. The player scores $\\beta^{\\rm Player}$ have a similar interpretation as the standard Elo score, and we let $\\beta$ denote the concatenation $(\\beta^{\\rm Player}, \\beta^{\\rm Model}, \\beta^{\\rm Prompt})$. For lack of a better term, we call this model “Extended Elo”.\n\nWhat problem is this new model solving that the old Elo algorithm couldn’t? The answer is in the efficiency of estimation. The standard Elo algorithm could apply in our setting by simply calling every model-prompt pair a distinct “opponent” for the purposes of calculating the leaderboard. However, this approach has two issues: \nIt cannot disentangle the effectiveness of the prompt versus that of the model. There is a single coefficient for the pair. Instead, extended Elo can assign _strength to each subpart_.\nThere are $M\\times R$ model-prompt pairs, and only $M+R$ distinct models and prompts. Therefore, asymptotically if $M$ and $R$ grow proportionally, the extended Elo procedure has a quadratic sample-size saving over the standard Elo procedure.\n\n\nNow, we solve this logistic regression problem _online_. That is, letting $\\ell(x,y;\\beta)$ be the binary cross-entropy loss, we use the iteration\n\\begin{equation}\n \\beta_n = \\beta_{n-1} - \\nabla_\\beta \\ell(X_{n-1}, Y_{n-1}; \\beta_{n-1}).\n\\end{equation}\nThis is a generalization of the Elo update. In fact, if one removes the prompt coefficient, it reduces exactly to the Elo update between players and models, as if these were 1-1 games.\n\nThat’s it! After updating the model coefficients in this way, we report them in the tables in the [RedTeam Arena](https://redarena.ai/leaderboard). We also have more plans for this approach: extended Elo can be used not just for 1v2 leaderboards, like this one, but any $N$v$M$-player leaderboards in order to attribute notions of strength to each subpart using binary human preference feedback.\n\n\n## What’s next?\n\n[RedTeam Arena](https://redarena.ai) is a community-driven project, and we’re eager to grow it further with your help! Whether through raising Github issues, creating PRs [here](https://github.com/redteaming-arena/redteam-arena), or providing feedback on [Discord](https://discord.gg/mP3PwbKG9m), we welcome all your contributions!\n","slug":"2024-09-13-redteam-arena"},"__N_SSG":true} | ||
{"pageProps":{"frontmatter":{"title":"RedTeam Arena: An Open-Source, Community-driven Jailbreaking Platform","author":"Anastasios Angelopoulos*, Luca Vivona*, Wei-Lin Chiang*, Aryan Vichare, Lisa Dunlap, Salvivona, Pliny, Ion Stoica","date":"Sep 13, 2024","previewImg":"/images/blog/redteam_arena/badwords.png"},"content":"\nWe are excited to launch [RedTeam Arena](https://redarena.ai), a community-driven redteaming platform, built in collaboration with [Pliny](https://x.com/elder_plinius) and the [BASI](https://discord.gg/Y6GxC59G) community!\n\n\n\n<img src=\"/images/blog/redteam_arena/badwords.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 80%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 1: RedTeam Arena with Bad Words at <a href=\"https://redarena.ai\">redarena.ai</a></p>\n\nRedTeam Arena is an [open-source](https://github.com/redteaming-arena/redteam-arena) red-teaming platform for LLMs. Our plan is to provide games that people can play to have fun, while sharpening their red-teaming skills. The first game we created is called *[Bad Words](https://redarena.ai)*, challenging players to convince models to say target \"bad words”. It already has strong community adoption, with thousands of users participating and competing for the top spot on the jailbreaker leaderboard.\n\nWe plan to open the data after a short responsible disclosure delay. We hope this data will help the community determine the boundaries of AI models—how they can be controlled and convinced.\n\nThis is not a bug bounty program, and it is not your grandma’s jailbreak arena. Our goal is to serve and grow the redteaming community. To make this one of the most massive crowdsourced red teaming initiatives of all time. From our perspective, models that are easily persuaded are not worse: they are just more controllable, and less resistant to persuasion. This can be good or bad depending on your use-case; it’s not black-and-white.\n\nWe need your help. Join our jailbreaking game at [redarena.ai](redarena.ai). All the code is open-sourced on [Github](https://github.com/redteaming-arena/redteam-arena). You can open issues and also send feedback on [Discord](https://discord.gg/mP3PwbKG9m). You are welcome to propose new games, or new bad words on X (just tag @[lmsysorg](https://x.com/lmsysorg) and @[elder_plinius](https://x.com/elder_plinius) so we see it)!\n\n\n## The Leaderboard: Extended Elo\n\n<img src=\"/images/blog/redteam_arena/leaderboard.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 100%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 2. Leaderboard screenshot. Latest version at <a href=\"https://redarena.ai/leaderboard\">redarena.ai/leaderboard</a></p>\n\nPeople have been asking how we compute the leaderboard of players, models, and prompts. The idea is to treat every round of Bad Words as a 1v1 game between a player and a (prompt, model) combination, and calculate the corresponding Elo score. Doing this naively is sample-inefficient and would result in slow convergence, so we instead designed a new statistical method for this purpose (writeup coming!) and we’ll describe it below.\n\n*Observation model.* Let $T$ be the number of battles (“time-steps”), $M$ be the number of models, $P$ be the number of players, and $R$ be the number of prompts. For each battle $i \\in [n]$, we have a player, a model, and a prompt, encoded as following:\n\n* $X_i^{\\rm Model} \\in \\{0,1\\}^M$, a one-hot vector with 1 on the entry of the model sampled in battle $i$.\n* $X_i^{\\rm Player} \\in \\{0,1\\}^P$, a one-hot vector with 1 on the entry of the player in battle $i$.\n* $X_i^{\\rm Prompt} \\in \\{0,1\\}^R$, a one-hot vector with 1 on the entry of the prompt sampled in battle $i$.\n* $Y_i \\in \\{0,1\\}$, a binary outcome taking the value 1 if the player won (or forfeited) and 0 otherwise.\n\nWe then model the win probability of the player as\n\\begin{equation}\n\t\\mathbb{P}(Y_i = 1 | X_i^{\\rm Model}, X_i^{\\rm Player}, X_i^{\\rm Prompt}) = \\frac{e^{X_i^{\\rm Player}\\beta^{\\rm Player}}}{e^{X_i^{\\rm Player}\\beta^{\\rm Player}} + e^{X_i^{\\rm Model}\\beta^{\\rm Model} + X_i^{\\rm Prompt}\\beta^{\\rm Prompt}}}.\n\\end{equation}\nThis form might look familiar, since it is the same type of model as the Arena Score: a logistic model. This is just a logistic model with a different, _additive_ structure—the model scores $\\beta^{\\rm Model}$ and prompt scores $\\beta^{\\rm Prompt}$ combine additively to generate a notion of total strength for the model-prompt pair. The player scores $\\beta^{\\rm Player}$ have a similar interpretation as the standard Elo score, and we let $\\beta$ denote the concatenation $(\\beta^{\\rm Player}, \\beta^{\\rm Model}, \\beta^{\\rm Prompt})$. For lack of a better term, we call this model “Extended Elo”.\n\nWhat problem is this new model solving that the old Elo algorithm couldn’t? The answer is in the efficiency of estimation. The standard Elo algorithm could apply in our setting by simply calling every model-prompt pair a distinct “opponent” for the purposes of calculating the leaderboard. However, this approach has two issues: \nIt cannot disentangle the effectiveness of the prompt versus that of the model. There is a single coefficient for the pair. Instead, extended Elo can assign _strength to each subpart_.\nThere are $M\\times R$ model-prompt pairs, and only $M+R$ distinct models and prompts. Therefore, asymptotically if $M$ and $R$ grow proportionally, the extended Elo procedure has a quadratic sample-size saving over the standard Elo procedure.\n\n\nNow, we solve this logistic regression problem _online_. That is, letting $\\ell(x,y;\\beta)$ be the binary cross-entropy loss, we use the iteration\n\\begin{equation}\n \\beta_n = \\beta_{n-1} - \\eta \\nabla_\\beta \\ell(X_{n-1}, Y_{n-1}; \\beta_{n-1}),\n\\end{equation}\nfor some learning rate $\\eta$.\nThis is a generalization of the Elo update. In fact, if one removes the prompt coefficient, it reduces exactly to the Elo update between players and models, as if these were 1-1 games.\n\nThat’s it! After updating the model coefficients in this way, we report them in the tables in the [RedTeam Arena](https://redarena.ai/leaderboard). We also have more plans for this approach: extended Elo can be used not just for 1v2 leaderboards, like this one, but any $N$v$M$-player leaderboards in order to attribute notions of strength to each subpart using binary human preference feedback.\n\n\n## What’s next?\n\n[RedTeam Arena](https://redarena.ai) is a community-driven project, and we’re eager to grow it further with your help! Whether through raising Github issues, creating PRs [here](https://github.com/redteaming-arena/redteam-arena), or providing feedback on [Discord](https://discord.gg/mP3PwbKG9m), we welcome all your contributions!\n","slug":"2024-09-13-redteam-arena"},"__N_SSG":true} |
This file was deleted.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
This file was deleted.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.