Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential length-controlled metric for Alpaca Eval 2.0 #225

Closed
viethoangtranduong opened this issue Feb 2, 2024 · 25 comments
Closed

Potential length-controlled metric for Alpaca Eval 2.0 #225

viethoangtranduong opened this issue Feb 2, 2024 · 25 comments
Assignees

Comments

@viethoangtranduong
Copy link
Contributor

viethoangtranduong commented Feb 2, 2024

Chatted with @YannDubs briefly in an email thread. Moving here for visibility. I'll let Yann add his responses later.

Given that GPT-4 has biases toward length, I propose we should add a length-controlled metric - a "macro win rate", dependent on the length of the response.

So we compute the win rate when responses are longer/shorter than the reference response, and then compute the average (balance_win_rate). Let me know what you think and I can help with a PR if this is useful.
Attached is the image of the top 30 of the leaderboard (see the last column). We don't need to update with new responses or any costly compute, we just need to compute the new metric with simple functions addition.

alpaca_eval_balance_win_rate

@mathew-eq4all
Copy link

I like the idea of a length-controlled metric! I worry that the unweighted average of the long win rate and short win rate might be too skewed for models that generate either mostly long or mostly short sequences though. What do you think?

How about weighting each win/loss with the evaluator's long-short preference?
As an example: $score_{balanced}=\frac{1}{\sum w(s_i, \hat{p})} \sum w(s_i, \hat{p}) * s_i$ where $\hat{p}$ is the probability that the evaluator prefers longer completions, $s_i$ is the win-lose-tie score (win=1, lose=0, tie=0.5), and the weight $w(s_i, \hat{p})$ is $1 - \hat{p}$ if the completion is longer, $\hat{p}$ if shorter, and $0.5$ otherwise.

@viethoangtranduong
Copy link
Contributor Author

viethoangtranduong commented Feb 7, 2024

cc @YannDubs

I know Yann has also been giving this some thought so will let him chime in.

Personally, I'd stay away from multiplying/referencing any "fixed" number like that fixed probability p or any specific model length (besides gpt-4-turbo) but agree that we need to control for the skewness of models that are overly long or short.

@gblazex
Copy link
Contributor

gblazex commented Feb 7, 2024

I did some work on this. Will publish soon. Stay tuned.

@YannDubs
Copy link
Collaborator

YannDubs commented Feb 7, 2024

Thanks for opening this thread and for all the ideas 💯 . Been very busy, but I have some work and thoughts that I'll share this week-end!

@YannDubs
Copy link
Collaborator

YannDubs commented Feb 11, 2024

Here's the notebook where I set up & discuss the metrics/properties I think are important when deciding which length-correction to apply

I analyzed different options including:

  1. what @viethoangtranduong proposed above (Balanced win rate in the notebook)
  2. what @mathew-eq4all proposed above (Weighted win-rate in the notebook)
  3. what @gblazex shared on twitter (Average length corrected win-rate in the notebook)

The good thing is that they all improve over the current win-rate on all metrics I'm considering (ability to gamify through by asking for verbosity in the prompt, correlation with Arena, correlation with length).

In terms of metrics (1) and (3) above perform best. I prefer (1) because the interpretation & range metric stays the same as the current AE. The downside is that it will not work well in cases where we have very few example that are longer or shorter than the baseline. This setting is relatively rare, e.g. If we threshold to have at least 40 (5%) of examples in each category, that would only filter out 10 models out of 110.

I encourage everyone to check the notebook and run it on new metrics you might be thinking about (e.g. @gblazex I'm not sure which one you are currently considering).

My plan is to add length corrected metric next week-end, so please let me know your thoughts before that!

@YannDubs
Copy link
Collaborator

Here's the summary of metrics for (1), (2) and (3) above as well as standard win-rate
Screenshot 2024-02-10 at 22 27 45

Screenshot 2024-02-10 at 22 27 53 Screenshot 2024-02-10 at 22 27 59 Screenshot 2024-02-10 at 22 31 02

@YannDubs YannDubs self-assigned this Feb 11, 2024
@gblazex
Copy link
Contributor

gblazex commented Feb 11, 2024

hey @YannDubs, GREAT WORK! Happy to contribute

I'm running your analysis to validate my results, but the GH repo is missing some model 'results' folders I believe:

load_annotations fails for
"gpt4_0314", "gpt-3.5-turbo-16k-0613", "gpt-3.5-turbo-1106"

@gblazex
Copy link
Contributor

gblazex commented Feb 11, 2024

Trying to optimize the length bias to approach the same magnitude as humans have is a brilliant idea. Much better than simply minimizing it.

But similarly we have no human basis to assume scoring difference of verbose/normal/concise versions of the same model should simply be minimized.

It’s important not to confuse the lack of value for under-appreciated value.
(short & dull response vs. short & quality response)

Possible next steps:

  • a) Unfortunately the optimal solution here would be to run Arena-like freestyle human feedback on different model versions, and try to approach their scoring spread. Hard to execute this. Unless lmsys team is open to run it even for a limited time.

  • b) Keep trying to minimize the spread

  • c) Drop this goal until a better intuition & still doable one is proposed

@gblazex
Copy link
Contributor

gblazex commented Feb 12, 2024

In the meantime I updated the ELO scores in the repo to have up-to-date & complete correlations.

#233

@YannDubs
Copy link
Collaborator

Thanks @gblazex I added the missing annotations to GH.

I agree with your point on verbose/normal/concise, i.e., we should be careful when minimizing and we don't know how humans are impacted by that. That being said:

  1. people are mostly adding models to AE, rather than systems (model + prompt). As a result, people expect the models not to have a huge variance for reasonable prompts. This point is less about human similarity, and more about how the benchmark is used.
  2. I'm ~assuming that the normal version already answers the question (as it was asked to) and thus verbosity shouldn't add too much gains. That's why I separated verbosity & conciseness gameability. For example, right now win-rate goes from 50% to 64% for gpt4-preview, which seems unideal. Other metrics bring that closer to ~55%.

@gblazex
Copy link
Contributor

gblazex commented Feb 13, 2024

A somewhat "compressed" update on what I'm working on / looking at

The avg length adjustment method shoots up the adjustment factor as responses get shorter and shorter. This might not be what we want.

Even if it seemingly improves the newly defined "gameability" metric.
(and that is my biggest gripe with that metric right now, that it can lead us to optimize for the wrong thing)

It might fairly improve the under-appreciated concise response from stronger models (GPT-4, Mixtral, ...) but it'll also unfairly boost very short responses of less capable models.

Davinci001 is the best example where it gets boosted from 2.76% win rate to 12.5% win rate, jumping from 98th place all the way up to 27th place.

New Project (4)

On the right you can see how a "naive"-avg-length-adjustment would over-boost Davinci001. This doesn't seem to pass the smell test to me.
But a logistic adjustment solves this issue (second column from right).

What I'm experimenting with is to dampen this by either:

  • completely switching to a logistic function below the gpt-4 turbo 2049 baseline avg_length (diminishing boosting for shorter and shorter models)
  • or at least use this logistic function to "dampen" the slope of what the original simple division yielded

image

Blue line is the "naive" avg length adjustment factor that gets applied to win-rate.
Orange line is what a logistic one would yield.
Cyan is what a combination of the two could look like (blue line dampened by orange with some weights)

The reason this diminished boosting makes sense to me is that we can be less & less sure that the lower win-rates in shorter & shorter responses is because of bad judgement vs. genuine lack of quality.

(e.g. Davinci001 avg_length is 296! That's simply not enough data to confidently boost it from 2.76% win rate to 12.5% win rate)

When looking at top of the leaderboard of the balanced-win-rate and avg-length-adjustment (or logistic-avg-length-adjustment) methods the latter ones pass the smell test more for me.
Snorkel goes down where they should be.

So I'm definitely still advocating for the avg-length-adjustment (with a logistic twist)

It also echoes people's sentiment on twitter.

Currently looking at other metrics to quantify this.


Finally here is an earlier comparison I ran between

  • original winrate (left),
  • "naive"-avg-len-adjusted winrate (middle) and
  • "logistic"-len-adjusted-winrate (right)

download (5)

Caveats:

  • whole Alpaca Leaderboard, not just ones with ELO
  • so spearman correl with length is not directly comparable to the colab numbers.
  • not scaled back to 50% baseline win rate

But the overall look feels good to me.

Both adjustments reduce dependance on length by a fair bit.

Logistic (rightmost) one doesn't even let very short responses shoot above trendline where we should be lacking in confidence to justify more boosting.

@viethoangtranduong
Copy link
Contributor Author

Thanks a lot, @YannDubs and @gblazex, for the detailed analysis. There isn't anything I'd add on the analysis side.

So, I'd just add my brief "human" opinion, and happy to discuss further. I treat AlpacaEval as a quick and easy-to-interpret auto-eval tool. I like (and propose) balanced_win_rate because it's simple and easy to grasp (like the typical ML macro score - like macro-accuracy), does not involve any constant/magic number, and makes minimal assumptions. I appreciate that it's also the highest correlation with Arena (actual human-eval and current gold labels). Hence, I favor simple metrics that people can easily interpret in seconds and don't need to be introduced to more custom numbers.

Caveat: I proposed the balanced win-rate score.

@gblazex
Copy link
Contributor

gblazex commented Feb 13, 2024

It’s a simple metric but it doesn’t solve the top of the leaderboard being gamed by Snorkel. Stays way too high.

doesn’t pass the sniff test for me

Scaling by proportionate length compared to GPT-4 Turbo baseline makes sense because it’s the judge itself (avg-length-adjustment). So it is a simple metric.

It handles the top of the leaderboard well (most important for the public opinion)

The only reason it needs more work is because of the bottom of the leaderboard.

The whole leaderboard is GPT-4 Turbo based so “magic” is already embedded in the leaderboard.
It’ll never be a truly transparent metric.

The goal is to make it look closer to Arena, being trustable while way cheaper.
With the trust issues of OpenLLM and other gamed leaderboard, leaving Snorkel and perceived gameability unsolved, it ends up being a half measure.

People not gonna like it at best.
It’ll be used to garner hype around the wrong kind of models at worst

@viethoangtranduong
Copy link
Contributor Author

viethoangtranduong commented Feb 14, 2024

As the author of the model you mentioned (Snorkel), would love to hear instances where you @gblazex (and the public) feel strongly about our model. My email is here. If you want to share thoughts and feedback and discuss further, I am happy to chat more and would appreciate it if we could move the discussion to an email thread not to distract the current discussion. The released model is our attempt to propose the iterative DPO approach with reward models (that Allen AI releases - PairRM) and not in any way to game the leaderboard (we released the data, code, recipe, everything). Meta's concurrent work self-rewarding LLM paper also sees the length effect after DPO iterations.

Regarding the leaderboard, I'd leave my last comments here and defer to @YannDubs and others for the final decisions.

I stand with balanced_win_rate thanks to its interpretability and simplicity (macro score). It's a metric, hence, it should be interpretable, and not be overly parameterized and treated like finding a line of best fit to Arena.

For average length metrics, by dividing the scores by length, we are calculating rewards per character, which is based on the assumption that if a model is 10% longer, we expect it to be 10% better. We see this is not the case if we look at those with best-of-16 vs. its base model - so we control for the same model: best-of-16 generates 16 variations and submits the one with the highest score judged by PairRM. If we compare the best-of-16 to the base model, SnorkelxPairRM DPO was 5% shorter than its base model yet yielded 15% higher scores. For Tulu, 13% longer for a 17% jump, and Yi 3% longer for a 5% jump. Also, I see the correlation between length and win rate, but a confounding factor is that later models are also longer (and they should hopefully be better). Also, if this is the eval board, that length-division metric would incentivize shorter responses in future models, as you'd have some extra points if you are shorter on average.

My only goal is to propose an alternative I found useful, and I hope it helps and does not show up as biased (toward the model I released) or reflective of anything else. My opinions are my own. Thanks all!

@gblazex
Copy link
Contributor

gblazex commented Feb 16, 2024

I have a lot of tests (and a few opinions) and plan to publish it here.

For now just a short version:
logistic length controlling comes out on top (combining it with avg-length-adjustment is even better).

Columns are the win_rate and adjusted win_rate methods.
Rows are the metrics that try to look how close the methods are to human preference & bias.

  • spearman_elo ↑: (adjusted) win_rate's spearman correlation to ELO (↑)
  • kendall_elo ↑: win_rate's kendall to ELO
  • mse_elo ↓: mean squared error when using WR to predict ELO (linear)
  • rmse_elo ↓: root mean squared error when using WR to predict ELO (linear)
  • r2_elo ↑: how much of variance in ELO is described by WR
  • d_len_bias_spearman ↓: abs(Spearman(len,ELO) - Spearman(len,WR))
    how closely does it mimics human length bias, according to Spearman
  • d_len_bias_kendall ↓: abs(Kendall(len,ELO) - Kendall(len,WR))
    how closely does it mimics human length bias, according to Kendall
  • d_r2_len ↓: abs(R2(len,ELO) - R2(len,WR))
    how close is R2 to human R2 with regards to length
  • overall_score ↑: averaging percentile ranks of all scores above
Screenshot 2024-02-15 at 22 00 12
Metric win_rate balanced avg_len_adj log_len_adj wght_log_len_adj
↑spearman_elo 0.934 0.933 0.944 0.947 0.961
↑kendall_elo 0.819 0.807 0.807 0.830 0.849
↓mse_elo 0.090 0.030 0.040 0.048 0.045
↓rmse_elo 0.300 0.172 0.200 0.219 0.213
↑r2_elo 0.694 0.843 0.832 0.839 0.853
↓d_len_bias_spearman 0.173 0.156 0.174 0.116 0.015
↓d_len_bias_kendall 0.115 0.109 0.139 0.054 0.010
↓d_r2_len 0.209 0.081 0.114 0.071 0.014
↑overall_score 0.063 0.493 0.420 0.628 0.928

Note: I also ran these against Arena GPT-4 Turbo win-rates (instead of ELO) and the conclusions were the same. Except there seems to be a bigger length bias in the Arena win-rates than ELO. And there are less samples. I think we can/should keep looking at ELO.

The logistic function I didn't include yet:

def logistic_twoface(x, x0, k = 0.003):
    # Set the parameters for the logistic function
    # to get a sigmoid-like shape
    #  "max y" param, later boosted by 0.5 to meet other side of slope
    L = 1
    # The steepness of the curve, k = 0.003

    if x >= x0:
        # Above baseline, penalty increases
        return x0 / x
    else:
        # Below baseline, boosting diminishes fast
        return -L / (1 + np.exp(-k * (x - x0))) + L + 0.5
        
x0 = lb.loc[BASELINE, "avg_length"] # 2049
logi_adj_factor = lb['avg_length'].apply(lambda x: logistic_twoface(x, x0, k=0.003))
lb['logi_len_adj_win_rate'] = lb['win_rate'] * logi_adj_factor

@gblazex
Copy link
Contributor

gblazex commented Feb 16, 2024

Our bias

First let's admit our biases I think.

We all have a bias to our proposed solution. Even though the idea to boost/penalize models proportionate to their length compared to baseline didn't come from me. But I tweeted about it, I worked a lot more on it since, so I have a bias towards it.

@viethoangtranduong, you have the same bias to your solution and some extra bias because of having a model on the leaderboard in a high position.
I'm not saying it intentionally affects decisions but it's good to be aware of this.

Off topic opinion

Snorkel doesn't show up on any other leaderboard in a high place.

It is listed on EQ-bench and suffered a similar fate to xDan (a model that looks specifically trained to ace MT-bench, but not much else).
EQ-bench has a very high correlation with human judgement as well. I believe if Snorkel raises in ranks in a "generalist" benchmark like Alpaca, it should raise at least moderately in EQ as well.
That leaderboard is notably zero percent biased towards length (but it's not generalist).

image

Screenshot 2024-02-15 at 23 23 09

(more modest results on EQ, close to OpenOrca, OpenHermes, NeuralHermes)

How good is this EQ-bench?
Samuel was gracious enough to add a lot of models and create a great overlap with arena models.
It's one of the highest correlations with Arena Elo

Spearman Correlations:
EQ-bench v2: 0.863
MT-bench: 0.891
Alpaca v2: 0.899

(I only checked overlapping rows
where models have results for all 3 benchmarks)
We have 31 EQ models matching Arena.

If a model fails to show notable improvements in something as well correlated to humans as EQ, it raises questions.

Now I think PariRM is an amazing technique, and it pushes the boundaries. I think Snorkel is a great research too, and hats off to making it this open.

But do I think on a generalist chatbot leaderboard it should shoot a 7B model so high above Mixtral and Gemini Pro?
I'm not sure that's realistic.

So even if it wasn't meant to, it can still inadvertently expose weaknesses in a leaderboard like Alpaca.

(note Deluxe-v1.2 on Arena might be doing best-of-N and/or picking best model as streaming is disabled for that model, and it seems to perform well, but ELO results still haven't been published since getting added in October)

Off topic suggestion (what we don't have)

I think the best would be if Snorkel showed up in Arena.
Now with their together.ai sponsorship & Snorkel being available there (which is great) it's very easy and cheap to add it.
They just finished to gather enough votes for the new models (Qwen 1.5, nous hermes mixtral, code llama 70), so after this round they might be open to a new addition.

At the end of the day we don't have enough human data on longer than GPT-4 Turbo responses.

On topic comment (on what we do know)

that length-division metric would incentivize shorter responses in future models, as you'd have some extra points if you are shorter on average.

You are gaining "some boost" for shorter responses only to counteract the "lost" points that the judge doesn't give you because of its verbosity-bias.
So it's not free points, it's just tilting back the scale to be a more even playing field (with a smaller human-like bias left in there).

Also the logistic graph showed how a tapering off can be applied that doesn't give more and more points as you go shorter and shorter even to counteract bias (because the confidence in the bias vs. quality drop diminishes).

Balanced approach suffers from imbalanced sample sizes and many times less than 10% of responses are in one bucket.

This is concerning to me and also if newer models are simply longer than GPT-4 turbo in even more cases it'll be even more problematic.

image

@YannDubs
Copy link
Collaborator

I updated the notebook with the three potential metric we could be considering (balanced win-rate, logistic length normalized win-rate, length controlled win-rate), my thoughts about them, and metric. I'm currently thinking of going with the last one but definitely open to suggestions/improvements/thoughts.

New results: length controlled win-rate

In statistics a common way to control for a covariate is to fit a (generalized) linear model conditioned on that covariate. Concretely, in our case, that means predicting with a logistic regression the preference /score of an output while conditioning on the variable we care about (the model) and the one we want to control (the length). Then we can use the logistic regression to essentially predict "what would be the win-rate if the length was the same as the baseline"*.

The results are shown at the end of the notebook but the metrics are the best so far.

# Report for **grouped_q20_delta_len + instruction_difficulty**

## Gameability (lower is better)
Verbosity gameability (relative std metric): 9.6%
Conciseness gameability (relative std metric): 14.5%

## Correlation with Arena (higher is better)
Spearman Corr: 0.967
Kendall Corr: 0.849

## Correlation with length (closer to spearman=0.24, kendall=0.16 is better)
Spearman Corr: 0.218
Kendall Corr: 0.135

Here's the top and bottom of the leaderboard.

Screenshot 2024-02-19 at 12 58 19

Overall I think this is the best option we currently have, but I do worry about it being more complex and how that will affect its use by the broader NLP/LLM-enthusiast community. Benefits/downsides from the notebook:

Screenshot 2024-02-19 at 12 59 10

Please let me know what you think!

* making causal conclusions requires more care, but that's the intuition of it.

Comments about above

Thanks, @gblazex for all the work. Few thoughts:

  • I also tried a bunch of functional form for length normalization and also found logistic to be the best (in particular, logistic with a temperature of t=std(length)/2 or t=500 if we don't want to have a per model temperature). I agree with your intuition there.
  • like you, I'm worried about the bottom of the leaderboard with length normalization, especially linear length.
  • note that we should really avoid considering metric like r2/rmse/mse/pearson with ELO, given that ELO & win-rate are not on the same scale/unit and that they can't be linearly related given that one is bounded but not the other.
  • I agree that we should be careful not to overoptimize the length gameability given that we don't know how humans are length gameable. Ideally, Arena would add some models with verbose/concise prompting.
  • I agree that passing the smell test at the top of the leaderboard is important to build trust. Although don't think this can take precedence on all the rest. Many people use AlpacaEval for model development (e.g. hparam tuning / early stopping) rather than model selection, meaning that they use them when win-rate are very small so we can't have major issues at low win rates & low lengths. Currently, our Arena correlations don't consider models in early development so we should be careful about that and ideally add worse models.
  • I agree that balanced win-rate will be less useful if we have models that are always very long, which could be an issue down the line
  • although we have biases I think we all genuinely want the best for AE and the final decision won't be made depending on who proposed what :)

Thanks for sharing your thoughts @viethoangtranduong, like you:

  • I really value simplicity & minimal assumptions, and I balanced win-rate is the best length-controlled metric we have in that aspect.
  • I'm pretty concerned about the gameability of length normalized win-rate, e.g. with very short lengths.
  • I also don't think we should be overoptimizing for correlation with Arena, we shoudl instead just aim to have a good enough correlation. To me, all the proposed metrics do.

@gblazex
Copy link
Contributor

gblazex commented Feb 21, 2024

Great work again Yann. On initial viewing I think this idea is genius and seems to work really well!

It didn't require you to use Arena ELO (except maybe verifying if you need to assume something other than linearity within bins), which sounds great.

The only thing that's not intuitive to me is how "instruction difficulty" entered the picture. Were initial correlations purely controlling for length not good enough?

In any case I think this method seems to work well and I don't worry that much about communication issue. It's a wildly used technique (especially in other fields as you said).

It's very easy to come up with examples e.g. looking at smoking's effect on cardiovascular health, but we want to control for body weight.

I would love to spend some time playing with it, but I have no objections if you decide to release it early. I think it's way better than what was available before (simple win_rate), less gameable than balanced one, and not as brittle/arena-specific as adjusting with a fitted regression / shape.

@cmglaze
Copy link

cmglaze commented Feb 22, 2024

Newcomer here to the thread (and full disclosure-- I work with @viethoangtranduong ): we don't want to assume that length correlations with win rate are entirely artificial, do we? Eg some longer responses may truly be better because they contain more useful information (and are longer as a result). Seems like logistic regression is a step in the right direction, but don't we really want to write:

AlpacaEval win rate = F(Human win rate, length)

and essentially factor out length with some model like that fit to samples of paired win rates (alpaca-human pairs)?

@gblazex
Copy link
Contributor

gblazex commented Feb 22, 2024

1. Length correlation is not removed completely

We are not assuming that length correlation is entirely artificial. Some correlation is "left in there" that approaches human preference. As you can see in the results of the newest method:

Correlation with length (closer to spearman=0.24, kendall=0.16 is better)
Spearman Corr: 0.218
Kendall Corr: 0.135

Where spearman=0.24, kendall=0.16 are the length correlations with human scores

2. Longer responses are allowed to beat shorter ones, the unfairness of the scoring shape is only corrected to be more fair

some longer responses may truly be better because they contain more useful

This is true, but also all discussed methods allow for this:

  • We are not forcing the win-rates to always favor longer responses.
  • The goal is to remove the unfair boost that the judge gives to longer responses.
  • The models with longer responses are allowed to be better than shorter ones even after adjustment.
    e.g. The only reason we don't have better than GPT-4 model is only because GPT-4 Turbo is no. 1 in Arena as well. There simply is no better model.

3. The biggest problem as I noted in my answer is:

"At the end of the day we don't have enough human data on longer than GPT-4 Turbo responses."

Snorkel is one of the only models (other than Yi & InternLM2) producing longer than GPT-4 Turbo responses, so if you guys could nudge lmsys about it (especially with cheap Together endpoint that they have access to), would be the greatest contribution!

Until then regression or controlling for covariate are both based on limited data on the longer end of the length spectrum, and some of it will be fitting on only a few datapoints, extrapolating from shorter responses (& intuition). **

** correction in my next comment

@cmglaze
Copy link

cmglaze commented Feb 22, 2024

There's probably length correlation left behind because it's a relatively simple GLM of a more complex relationship. The fact that there may remain a weak correlation doesn't necessarily mean that the component driving that correlation is the same that drives human judgments.

That all said, sounds like we agree on the need for more human win rate data to resolve all of these ambiguities, your idea about focusing on the regime with longer-than-gpt-4 response sounds like a good place to begin. More generally, once we have more human data for direct comparison, I'd strongly recommend a final length-adjustment model leverage that directly (ie let's do our best to ensure to length component left behind more directly maps to whatever drives humans to sometimes prefer longer responses).

@gblazex
Copy link
Contributor

gblazex commented Feb 22, 2024

Trying to be short, not to add too many comments to Yann's backlog:

  • You are right: simply weak correlation numbers can't prove the source of that correlation is same.
  • But useful to remember: goal is to improve current system & proposed method is much better on all metrics (including human reactions on twitter)
  • It's not perfect, but it's the best we have! Highest correlation to human preference across all benchmarks (including MT-bench, EQ-bench, MMLU, AGIEval, BBH, etc.)
  • Actual proposals & ideas that further improve it are welcome.

One correction to my previous comment: Yann's new covariate method works on a per-question level (not per model) so I think there is a lot more data considered where responses are longer than GPT-4 Turbo.

I feel like the 3rd one is both:

  • a more sophisticated version of the balanced method,
  • and a more extensible (can control for different covariates later) & fine-grained (per-question level) version of the per-model regression one.

@cmglaze
Copy link

cmglaze commented Feb 23, 2024

That all makes sense to me and thanks for taking the time to discuss here. Agreed it's at least an improvement given the higher correlation with human prefs on the other benchmarks. We're also discussing internally how we can contribute to the open source community wrt scalable evaluation methods that align with human prefs with less bias. Very much appreciate the hard work here on AlpacaEval!

@YannDubs
Copy link
Collaborator

Hey everyone! I improved the last method in the notebook (PR: #244, notebook ) to have some nice mathematical properties. It's not perfect but IMO it's good enough to ship. My plan is to do so this week-end! Feel free to drop comments before and I'll try to incorporate them! Here are the final metrics:

Screenshot 2024-02-27 at 21 19 09

Note that given those nice new properties, we can now do some cool thinks like predict the win-rate between any two models (length corrected or not). Which means we can also predict ELO ratings. So @gblazex you can finally use Pearson correlations if you want! E.g. below

Screenshot 2024-02-27 at 21 18 10

Comments to previous messages:

  • I agree with all the recent messages that have been sent. Especially the fact that we most likely don't retain the same length bias as humans. The length-corrected version above is a good improvement but doesn't replace humans.
  • @gblazex concerning "instruction difficulty" it's hopefully a little more interpretable in the new notebook. I think the name is not great, but the intuition is that if absolute value for example_i it means that the distribution of preference for that example is more spiky, i.e., there's more often a clear winner.

@YannDubs
Copy link
Collaborator

YannDubs commented Mar 20, 2024

Finally done, thank you everyone and sorry that it took me so long!

PR: #258
Explanation tweet: https://x.com/yanndubs/status/1770284212937207840?s=20
Paper: https://arxiv.org/abs/2404.04475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants