-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bd33eb3
commit 1e2eda1
Showing
111 changed files
with
79 additions
and
77 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
File renamed without changes.
Large diffs are not rendered by default.
Oops, something went wrong.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions
1
_next/data/BG3jBQgYP8OEpEw9jp0zu/blog/2024-05-17-category-hard.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"pageProps":{"frontmatter":{"title":"Introducing Hard Prompts Category in Chatbot Arena","author":"Tianle Li, Wei-Lin Chiang","date":"May 17, 2024","previewImg":"/images/blog/category_hard/preview.png"},"content":"\n### Background\n\nWe introduce **Hard Prompts**, a new and challenging category in the Chatbot Arena [Leaderboard](https://leaderboard.lmsys.org).\n\nOver the past few months, we have been hearing growing interests from the community in seeing more challenging prompts that push the limits of current language models. To address this demand, we are excited to introduce the **Hard Prompts** category, which features Arena's user prompts that are specifically designed to be more complex, demanding, and rigorous. These prompts are carefully curated to test the capabilities of the latest language models and to explore the boundaries of what they can achieve. We believe this new category can provide valuable insights into the strengths and weaknesses of different models in more challenging tasks.\n\n### New Category: Hard Prompts!\n\nTo evaluate the difficulty of a prompt, we define several hardness criteria such as domain knowledge, complexity, or problem-solving. A prompt that satisfies multiple criteria is considered to be more challenging and is assigned a higher hardness score. We then use these scores to create a new leaderboard category, **Hard Prompts**.\n\nIn Figure 1, we present the ranking shift from English to Hard Prompts (English) . We observe **Llama-3-8B-Instruct**, which performs on par with **GPT-4-0314** on the English leaderboard, has seen a significant drop in ranking. This suggests that the model may struggle with the increased complexity and difficulty of the prompts in this new specialized category. Similarly, we observe **Claude-3-Opus** is now above **Llama-3-70B-Instruct** and slight improvement in **GPT-4o**.\n\n<img src=\"/images/blog/category_hard/elo_comparison_1.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 1. Comparison between Chatbot Arena Category English vs Hard Prompts (English). We set gpt-4-0314 as anchor model.</p>\n\nWe also observe notable improvements in **GPT-3.5-Turbo-1106/0125** and **Claude-2.1**, as well as **Phi-3**, which is trained to perform well in reasoning tasks. \n\n<img src=\"/images/blog/category_hard/elo_comparison_2.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 2. Comparison between Chatbot Arena Category English vs Hard Prompts (English). We set mixtral-8x7b-instruct-v0.1 as anchor model.</p>\n\n\n### How to Define Hard Prompts?\n\nA few weeks ago, we introduce the [Arena-Hard](https://lmsys.org/blog/2024-04-19-arena-hard/) pipeline to identify a collection of high-quality prompts from Chatbot Arena. Each user prompt is evaluated against the 7 Key Criteria defined in below Table.\n\n<table style=\"width:100%; border-collapse: collapse; border: 1px solid black;\">\n <tr style=\"background-color: black; color: white;\">\n <!-- <th style=\"border: 1px solid black; padding: 10px; text-align: left;\">7 Key \"Hardness\" Criteria</th> -->\n </tr>\n <tr>\n <td style=\"border: 1px solid black; padding: 10px; text-align: left;\"><strong>1. Specificity:</strong> Does the prompt ask for a specific output?</td>\n </tr>\n <tr>\n <td style=\"border: 1px solid black; padding: 10px; text-align: left;\"><strong>2. Domain Knowledge:</strong> Does the prompt cover one or more specific domains?</td>\n </tr>\n <tr>\n <td style=\"border: 1px solid black; padding: 10px; text-align: left;\"><strong>3. Complexity:</strong> Does the prompt have multiple levels of reasoning, components, or variables?</td>\n </tr>\n <tr>\n <td style=\"border: 1px solid black; padding: 10px; text-align: left;\"><strong>4. Problem-Solving:</strong> Does the prompt directly involve the AI to demonstrate active problem-solving skills?</td>\n </tr>\n <tr>\n <td style=\"border: 1px solid black; padding: 10px; text-align: left;\"><strong>5. Creativity:</strong> Does the prompt involve a level of creativity in approaching the problem?</td>\n </tr>\n <tr>\n <td style=\"border: 1px solid black; padding: 10px; text-align: left;\"><strong>6. Technical Accuracy:</strong> Does the prompt require technical accuracy in the response?</td>\n </tr>\n <tr>\n <td style=\"border: 1px solid black; padding: 10px; text-align: left;\"><strong>7. Real-world Application:</strong> Does the prompt relate to real-world applications?</td>\n </tr>\n</table>\n\nWe employ Meta's **Llama-3-70B-Instruct** as the judge model to help us label over 1 million Arena battles. Figure 3 shows the criteria breakdown (i.e., how many prompts satisfy each criteria). We observe the most common criteria are Specificity, Domain Knowledge, and Real-world Application, while the relatively rare criteria are Problem-Solving and Complexity.\n\n<img src=\"/images/blog/category_hard/key_criteria_breakdown.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 3. The percentage of each criteria within 1 million Chatbot Arena data.</p>\n\nWe then calculate its Hardness Score by how many criteria are satisfied and present the distribution in Figure 3. Interestingly, we find that ~20% of prompts are >=6 score. You can find several examples below to demonstrate what a hard prompt actually looks like in the [Example Section](#example).\n\n\n<img src=\"/images/blog/category_hard/hardness_breakdown.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 85%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 4. The percentage of prompts with different hardness score within 1 million Chatbot Arena data.</p>\n\n\nWe then take the score>=6 prompts (satisfy 6 or more of these criteria) to be the \"Hard Prompts\" category and calculate two leaderboards, **Hard Prompt (English)** and **Hard Prompts (Overall)**.\n\nBelow is screenshot of the leaderboard for **Hard Prompts (English)** category (as of May 17, 2024). You can find the latest version at [https://leaderboard.lmsys.org](https://leaderboard.lmsys.org) (-> Category dropdown).\n\n<img src=\"/images/blog/category_hard/leaderboard.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 95%\"></img>\n<p style=\"color:gray; text-align: center;\">Figure 5. The leaderboard for Hard Prompts (English) category as of May 17, 2024.</p>\n\n\nWe hope to continually enhance the Chatbot Arena leaderboard and its analysis. We look forward to seeing how the latest advancements in language models perform on these challenging prompts, and to sharing these insights with the broader community.\n\n## Citation\n```\n@misc{arenacategoryhard2024,\n title = {Introducing Hard Prompts Category in Chatbot Arena},\n url = {https://lmsys.org/blog/2024-05-17-category-hard/},\n author = {Tianle Li, Wei-Lin Chiang},\n month = {May},\n year = {2024}\n}\n```\n\n## Example\nWe present 10 examples of user prompt with increasing hardness score. The labeled criteria are inside the bracket.\n\n**Prompt 1:**\n\n[None]\n\nhello\n\n\n**Prompt 2:**\n\n[Real World]\n\nwhat is cake\n\n\n**Prompt 3:**\n\n[Creativity, Real World]\n\nHow to pickup a girl?\n\n\n**Prompt 4:**\n\n[Specificity, Creativity, Real World]\n\nwriten ten different sentences that end with word \"apple\"\n\n\n**Prompt 5:**\n\n[Specificity, Creativity, Real World]\n\nWriting prompt: write the start of a short story / a man with an iphone is transported back to 1930s USA. \n\n\n**Prompt 6:** \n\n[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]\n\ntell me how to make a hydroponic nutrient solution at home to grow lettuce with precise amount of each nutrient\n\n\n**Prompt 7:** \n\n[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]\n\nSolve the integral $\\int_{-\\infty}^{+\\infty} exp(-x^2) dx $ step-by-step with detailed explanation\n\n\n**Prompt 8:** \n\n[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]\n\nwrite me GLSL code which can gennrate at least 5 colors and 2 waves of particles cross each other\t\n\n\n**Prompt 9:**\n\n[Specificity, Domain Knowledge, Complexity, Problem-solving, Technical Accuracy, Real World]\n\nMy situation is this: I’m setting up a server running at home Ubuntu to run an email server and a few other online services. As we all know, for my email to work reliably and not get blocked I need to have an unchanging public IP address. Due to my circumstances I am not able to get a static IP address through my ISP or change ISPs at the moment.\n\nThe solution I have found is to buy a 4G SIM card with a static IP (from an ISP that offers that), which I can then use with a USB dongle. However this 4G connection costs me substantially per MB to use.\n\nBut. Mail is the only server that needs a static IP address. For everything else using my home network connection and updating my DNS records with DDNS would be fine. I have tested this setup previously for other services and it has worked.\n\nSo. I was wondering. Would it in theory be possible to: connect the server to two network interfaces at the same time and route traffic depending on destination port. I.e. all outgoing connections to ports 25, 465, 587, and possibly 993 should be sent through the 4G dongle interface (enx344b50000000) and all other connections sent over eth0. Similarly, the server should listen for incoming connections on the same ports on enx344b50000000 and listen on all other ports (if allowed by ufw) on eth0.\n\nI would then need DNS records from mail.mydomain.tld —> <4g static public IP> and mydomain.tld —> <home public IP> (updated with DDNS, and NAT configured on my home router).\n\nComputers on the internet would then be able to seamlessly connect to these two IP addresses, not “realising” that they are in fact the same machine, as long as requests to mail.mydomain.tld are always on the above mentioned ports.\n\nQuestion: Is this possible? Could it be a robust solution that works the way I hope? Would someone be able to help me set it up?\n\nI have come across a few different guides in my DuckDuckGo-ing, I understand it has to do with setting a mark in iptables and assigning them to a table using ip route. However I haven't managed to get it to work yet, and many of these guides are for VPNs and they all seem to be slightly different to each other. So I thought I would ask about my own specific use case\n\n\n**Prompt 10:** \n\n[Specificity, Domain Knowledge, Complexity, Problem-solving, Creativity, Technical Accuracy, Real World]\n\nWrite me a python script for the foobar problem, but make it so that if read aloud, each pair of lines rhymes. (i.e. lines 1/2 rhyme, 3/4 rhyme and so on)","slug":"2024-05-17-category-hard"},"__N_SSG":true} |
File renamed without changes.
File renamed without changes.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.