-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential length-controlled metric for Alpaca Eval 2.0 #225
Comments
I like the idea of a length-controlled metric! I worry that the unweighted average of the long win rate and short win rate might be too skewed for models that generate either mostly long or mostly short sequences though. What do you think? How about weighting each win/loss with the evaluator's long-short preference? |
cc @YannDubs I know Yann has also been giving this some thought so will let him chime in. Personally, I'd stay away from multiplying/referencing any "fixed" number like that fixed probability p or any specific model length (besides gpt-4-turbo) but agree that we need to control for the skewness of models that are overly long or short. |
I did some work on this. Will publish soon. Stay tuned. |
Thanks for opening this thread and for all the ideas 💯 . Been very busy, but I have some work and thoughts that I'll share this week-end! |
Here's the notebook where I set up & discuss the metrics/properties I think are important when deciding which length-correction to apply I analyzed different options including:
The good thing is that they all improve over the current win-rate on all metrics I'm considering (ability to gamify through by asking for verbosity in the prompt, correlation with Arena, correlation with length). In terms of metrics (1) and (3) above perform best. I prefer (1) because the interpretation & range metric stays the same as the current AE. The downside is that it will not work well in cases where we have very few example that are longer or shorter than the baseline. This setting is relatively rare, e.g. If we threshold to have at least 40 (5%) of examples in each category, that would only filter out 10 models out of 110. I encourage everyone to check the notebook and run it on new metrics you might be thinking about (e.g. @gblazex I'm not sure which one you are currently considering). My plan is to add length corrected metric next week-end, so please let me know your thoughts before that! |
hey @YannDubs, GREAT WORK! Happy to contribute I'm running your analysis to validate my results, but the GH repo is missing some model 'results' folders I believe:
|
Trying to optimize the length bias to approach the same magnitude as humans have is a brilliant idea. Much better than simply minimizing it. But similarly we have no human basis to assume scoring difference of verbose/normal/concise versions of the same model should simply be minimized. It’s important not to confuse the lack of value for under-appreciated value. Possible next steps:
|
In the meantime I updated the ELO scores in the repo to have up-to-date & complete correlations. |
Thanks @gblazex I added the missing annotations to GH. I agree with your point on verbose/normal/concise, i.e., we should be careful when minimizing and we don't know how humans are impacted by that. That being said:
|
A somewhat "compressed" update on what I'm working on / looking at The avg length adjustment method shoots up the adjustment factor as responses get shorter and shorter. This might not be what we want. Even if it seemingly improves the newly defined "gameability" metric. It might fairly improve the under-appreciated concise response from stronger models (GPT-4, Mixtral, ...) but it'll also unfairly boost very short responses of less capable models. Davinci001 is the best example where it gets boosted from 2.76% win rate to 12.5% win rate, jumping from 98th place all the way up to 27th place. On the right you can see how a "naive"-avg-length-adjustment would over-boost Davinci001. This doesn't seem to pass the smell test to me. What I'm experimenting with is to dampen this by either:
Blue line is the "naive" avg length adjustment factor that gets applied to win-rate. The reason this diminished boosting makes sense to me is that we can be less & less sure that the lower win-rates in shorter & shorter responses is because of bad judgement vs. genuine lack of quality. (e.g. Davinci001 avg_length is 296! That's simply not enough data to confidently boost it from 2.76% win rate to 12.5% win rate) When looking at top of the leaderboard of the balanced-win-rate and avg-length-adjustment (or logistic-avg-length-adjustment) methods the latter ones pass the smell test more for me. So I'm definitely still advocating for the avg-length-adjustment (with a logistic twist) It also echoes people's sentiment on twitter. Currently looking at other metrics to quantify this. Finally here is an earlier comparison I ran between
Caveats:
But the overall look feels good to me. Both adjustments reduce dependance on length by a fair bit. Logistic (rightmost) one doesn't even let very short responses shoot above trendline where we should be lacking in confidence to justify more boosting. |
Thanks a lot, @YannDubs and @gblazex, for the detailed analysis. There isn't anything I'd add on the analysis side. So, I'd just add my brief "human" opinion, and happy to discuss further. I treat AlpacaEval as a quick and easy-to-interpret auto-eval tool. I like (and propose) balanced_win_rate because it's simple and easy to grasp (like the typical ML macro score - like macro-accuracy), does not involve any constant/magic number, and makes minimal assumptions. I appreciate that it's also the highest correlation with Arena (actual human-eval and current gold labels). Hence, I favor simple metrics that people can easily interpret in seconds and don't need to be introduced to more custom numbers. Caveat: I proposed the balanced win-rate score. |
It’s a simple metric but it doesn’t solve the top of the leaderboard being gamed by Snorkel. Stays way too high. doesn’t pass the sniff test for me Scaling by proportionate length compared to GPT-4 Turbo baseline makes sense because it’s the judge itself (avg-length-adjustment). So it is a simple metric. It handles the top of the leaderboard well (most important for the public opinion) The only reason it needs more work is because of the bottom of the leaderboard. The whole leaderboard is GPT-4 Turbo based so “magic” is already embedded in the leaderboard. The goal is to make it look closer to Arena, being trustable while way cheaper. People not gonna like it at best. |
As the author of the model you mentioned (Snorkel), would love to hear instances where you @gblazex (and the public) feel strongly about our model. My email is here. If you want to share thoughts and feedback and discuss further, I am happy to chat more and would appreciate it if we could move the discussion to an email thread not to distract the current discussion. The released model is our attempt to propose the iterative DPO approach with reward models (that Allen AI releases - PairRM) and not in any way to game the leaderboard (we released the data, code, recipe, everything). Meta's concurrent work self-rewarding LLM paper also sees the length effect after DPO iterations. Regarding the leaderboard, I'd leave my last comments here and defer to @YannDubs and others for the final decisions. I stand with balanced_win_rate thanks to its interpretability and simplicity (macro score). It's a metric, hence, it should be interpretable, and not be overly parameterized and treated like finding a line of best fit to Arena. For average length metrics, by dividing the scores by length, we are calculating rewards per character, which is based on the assumption that if a model is 10% longer, we expect it to be 10% better. We see this is not the case if we look at those with My only goal is to propose an alternative I found useful, and I hope it helps and does not show up as biased (toward the model I released) or reflective of anything else. My opinions are my own. Thanks all! |
Our biasFirst let's admit our biases I think. We all have a bias to our proposed solution. Even though the idea to boost/penalize models proportionate to their length compared to baseline didn't come from me. But I tweeted about it, I worked a lot more on it since, so I have a bias towards it. @viethoangtranduong, you have the same bias to your solution and some extra bias because of having a model on the leaderboard in a high position. Off topic opinionSnorkel doesn't show up on any other leaderboard in a high place. It is listed on EQ-bench and suffered a similar fate to xDan (a model that looks specifically trained to ace MT-bench, but not much else). (more modest results on EQ, close to OpenOrca, OpenHermes, NeuralHermes) How good is this EQ-bench? Spearman Correlations: (I only checked overlapping rows If a model fails to show notable improvements in something as well correlated to humans as EQ, it raises questions. Now I think PariRM is an amazing technique, and it pushes the boundaries. I think Snorkel is a great research too, and hats off to making it this open. But do I think on a generalist chatbot leaderboard it should shoot a 7B model so high above Mixtral and Gemini Pro? So even if it wasn't meant to, it can still inadvertently expose weaknesses in a leaderboard like Alpaca. (note Deluxe-v1.2 on Arena might be doing best-of-N and/or picking best model as streaming is disabled for that model, and it seems to perform well, but ELO results still haven't been published since getting added in October) Off topic suggestion (what we don't have)I think the best would be if Snorkel showed up in Arena. At the end of the day we don't have enough human data on longer than GPT-4 Turbo responses.On topic comment (on what we do know)
You are gaining "some boost" for shorter responses only to counteract the "lost" points that the judge doesn't give you because of its verbosity-bias. Also the logistic graph showed how a tapering off can be applied that doesn't give more and more points as you go shorter and shorter even to counteract bias (because the confidence in the bias vs. quality drop diminishes). Balanced approach suffers from imbalanced sample sizes and many times less than 10% of responses are in one bucket. This is concerning to me and also if newer models are simply longer than GPT-4 turbo in even more cases it'll be even more problematic. |
I updated the notebook with the three potential metric we could be considering (balanced win-rate, logistic length normalized win-rate, length controlled win-rate), my thoughts about them, and metric. I'm currently thinking of going with the last one but definitely open to suggestions/improvements/thoughts. New results: length controlled win-rateIn statistics a common way to control for a covariate is to fit a (generalized) linear model conditioned on that covariate. Concretely, in our case, that means predicting with a logistic regression the preference /score of an output while conditioning on the variable we care about (the model) and the one we want to control (the length). Then we can use the logistic regression to essentially predict "what would be the win-rate if the length was the same as the baseline"*. The results are shown at the end of the notebook but the metrics are the best so far.
Here's the top and bottom of the leaderboard. Overall I think this is the best option we currently have, but I do worry about it being more complex and how that will affect its use by the broader NLP/LLM-enthusiast community. Benefits/downsides from the notebook: Please let me know what you think! * making causal conclusions requires more care, but that's the intuition of it. Comments about aboveThanks, @gblazex for all the work. Few thoughts:
Thanks for sharing your thoughts @viethoangtranduong, like you:
|
Great work again Yann. On initial viewing I think this idea is genius and seems to work really well! It didn't require you to use Arena ELO (except maybe verifying if you need to assume something other than linearity within bins), which sounds great. The only thing that's not intuitive to me is how "instruction difficulty" entered the picture. Were initial correlations purely controlling for length not good enough? In any case I think this method seems to work well and I don't worry that much about communication issue. It's a wildly used technique (especially in other fields as you said). It's very easy to come up with examples e.g. looking at smoking's effect on cardiovascular health, but we want to control for body weight. I would love to spend some time playing with it, but I have no objections if you decide to release it early. I think it's way better than what was available before (simple win_rate), less gameable than balanced one, and not as brittle/arena-specific as adjusting with a fitted regression / shape. |
Newcomer here to the thread (and full disclosure-- I work with @viethoangtranduong ): we don't want to assume that length correlations with win rate are entirely artificial, do we? Eg some longer responses may truly be better because they contain more useful information (and are longer as a result). Seems like logistic regression is a step in the right direction, but don't we really want to write: AlpacaEval win rate = F(Human win rate, length) and essentially factor out length with some model like that fit to samples of paired win rates (alpaca-human pairs)? |
1. Length correlation is not removed completelyWe are not assuming that length correlation is entirely artificial. Some correlation is "left in there" that approaches human preference. As you can see in the results of the newest method:
Where spearman=0.24, kendall=0.16 are the length correlations with human scores 2. Longer responses are allowed to beat shorter ones, the unfairness of the scoring shape is only corrected to be more fair
This is true, but also all discussed methods allow for this:
3. The biggest problem as I noted in my answer is:"At the end of the day we don't have enough human data on longer than GPT-4 Turbo responses." Snorkel is one of the only models (other than Yi & InternLM2) producing longer than GPT-4 Turbo responses, so if you guys could nudge lmsys about it (especially with cheap Together endpoint that they have access to), would be the greatest contribution! Until then regression or controlling for covariate are both based on limited data on the longer end of the length spectrum, and some of it will be fitting on only a few datapoints, extrapolating from shorter responses (& intuition). ** ** correction in my next comment |
There's probably length correlation left behind because it's a relatively simple GLM of a more complex relationship. The fact that there may remain a weak correlation doesn't necessarily mean that the component driving that correlation is the same that drives human judgments. That all said, sounds like we agree on the need for more human win rate data to resolve all of these ambiguities, your idea about focusing on the regime with longer-than-gpt-4 response sounds like a good place to begin. More generally, once we have more human data for direct comparison, I'd strongly recommend a final length-adjustment model leverage that directly (ie let's do our best to ensure to length component left behind more directly maps to whatever drives humans to sometimes prefer longer responses). |
Trying to be short, not to add too many comments to Yann's backlog:
One correction to my previous comment: Yann's new covariate method works on a per-question level (not per model) so I think there is a lot more data considered where responses are longer than GPT-4 Turbo. I feel like the 3rd one is both:
|
That all makes sense to me and thanks for taking the time to discuss here. Agreed it's at least an improvement given the higher correlation with human prefs on the other benchmarks. We're also discussing internally how we can contribute to the open source community wrt scalable evaluation methods that align with human prefs with less bias. Very much appreciate the hard work here on AlpacaEval! |
Hey everyone! I improved the last method in the notebook (PR: #244, notebook ) to have some nice mathematical properties. It's not perfect but IMO it's good enough to ship. My plan is to do so this week-end! Feel free to drop comments before and I'll try to incorporate them! Here are the final metrics: Note that given those nice new properties, we can now do some cool thinks like predict the win-rate between any two models (length corrected or not). Which means we can also predict ELO ratings. So @gblazex you can finally use Pearson correlations if you want! E.g. below Comments to previous messages:
|
Finally done, thank you everyone and sorry that it took me so long! PR: #258 |
Chatted with @YannDubs briefly in an email thread. Moving here for visibility. I'll let Yann add his responses later.
Given that GPT-4 has biases toward length, I propose we should add a length-controlled metric - a "macro win rate", dependent on the length of the response.
So we compute the win rate when responses are longer/shorter than the reference response, and then compute the average (balance_win_rate). Let me know what you think and I can help with a PR if this is useful.
Attached is the image of the top 30 of the leaderboard (see the last column). We don't need to update with new responses or any costly compute, we just need to compute the new metric with simple functions addition.
The text was updated successfully, but these errors were encountered: