Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is possible to use existing evaluation files to complete the evaluation? #263

Closed
bittersweet1999 opened this issue Mar 27, 2024 · 12 comments

Comments

@bittersweet1999
Copy link

bittersweet1999 commented Mar 27, 2024

Thanks for such a robust work!
It is possible to use existing evaluation files to complete the evaluation? For example, I have used my own script to complete the evaluation of GPT-4-1106 under the weighted_alpaca_eval_gpt4_turbo method and get each response of GPT-4-1106 like below

[{'finish_reason': 'length', 'index': 0, 'logprobs': {'content': [{'token': 'M', 'bytes': [77], 'logprob': -0.017597131, 'top_logprobs': [{'token': 'M', 'bytes': [77], 'logprob': -0.017597131}, {'token': 'm', 'bytes': [109], 'logprob': -4.048847}, {'token': 'Both', 'bytes': [66, 111, 116, 104], 'logprob': -14.908222}, {'token': 'The', 'bytes': [84, 104, 101], 'logprob': -15.705097}, {'token': 'Based', 'bytes': [66, 97, 115, 101, 100], 'logprob': -15.955097}]}]}, 'message': {'content': 'M', 'role': 'assistant', 'function_call': None, 'tool_calls': None}, 'text': 'M', 'total_tokens': 390.0}]

Can I use some script to directly convert this to the format of annotations.json to obtain the final results?

@YannDubs
Copy link
Collaborator

YannDubs commented Mar 27, 2024

Not sure how you ended up with that situation, but did you keep track of which instruction and output that gave the generation?

if so I'd recommend formatting everything as such:

[
  {
    "instruction":"instruction",
    "output_1":"output of the baseline",
   "output_2": "output of the model being evaluated",
  "annotator":"weighted_alpaca_eval_gpt4_turbo",
  "preference":null,
  "raw_completion":{ the dict you showed above}
}, ...
]

then save (or append if it exists) this file as src/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json, which is the caching file. Once you have that you can run the command alpaca_eval --model_outputs ... --is_reapply_parsing True which will use the cache but reapply caching.

if you did not save the instructions/outputs then there isn't much you can do as you won't even know what model M or m refers to.

@bittersweet1999
Copy link
Author

bittersweet1999 commented Mar 28, 2024

Thanks for your response.
We have support alpacaeval in opencompass, which can be used by this config

https://github.com/open-compass/opencompass/blob/main/configs/eval_subjective_alpacaeval.py

OpenCompass is an evaluation platform that can partition tasks and support different model inference backends, thereby accelerating the model evaluation process.
After integrating the advantages of alpacaeval and OpenCompass, it is now possible to directly select a model to perform rapid inference and evaluation in one step.
Return to this issue, my intention was to also break down the step of evaluating gpt4-turbo, so that this step could also leverage OpenCompass for partition and directly utilize the post-processing of alpacaeval.
By the way, we might be able to cooperate more in model evaluation.

@YannDubs
Copy link
Collaborator

YannDubs commented Apr 2, 2024

Great work@bittersweet1999! I love the idea of incorporating AlpacaEval LC to Opencompass and I'm happy to help.

Just to make sure you understand: are you saying that you want a simple script that takes the above json and gives you the final preference? Can you depend on alpaca_eval?

@bittersweet1999
Copy link
Author

Well, thanks for your response!
Yes, in the current implementation, I am launching using alpaca_eval through the command line. Therefore, in the current implementation of OpenCompass, the evaluation stage can only be executed through alpaca_eval. However, I would like to take it a step further by using OpenCompass to perform both inference and evaluation stages, then obtaining an above JSON, and finally conducting post-processing through the command of alpaca_eval by passing the path of this JSON to obtain a series of results.

@bittersweet1999
Copy link
Author

By the way, I'm also researching the issues of length control win rate and position bias win rate. I'm wondering if there's a simpler way to calculate the length control win rate. In our current in-house implementation, assuming the reply length of GPT-4 is A and the reply length of the comparative model is B, we calculate the win rate of this question by computing either 0*[1-(B-A)/B] or 1*[1-(B-A)/B]. With this calculation method, firstly, we can ensure that the win rate of GPT-4 itself is always 50%. However, if the reply length of the comparative model exceeds that of GPT-4 and it still wins this question, its win rate will be less than 1. Conversely, if its length is shorter than GPT-4's and it wins, it means its answer quality is better than GPT-4's, resulting in a win rate greater than 1. I'd like to know your thoughts on this calculation method.

@YannDubs
Copy link
Collaborator

YannDubs commented Apr 2, 2024

For your first question, your current JSON is missing some information. Are you randomizing the outputs when you give it to GPT4 for eval?

here are the high-level steps of AlpacaEval:

  1. For each instruction, decode the model outputs and add the reference outputs.
  2. Randomize the order of the model and the reference. One becomes M and the other m but the mapping is random. This is important given that LLM judges typically prefer the last output.
  3. OpenAI's GPT4 Preview judges its preference by asking a single token (M or m) with logprobs. Outputing only a single token decreases the eval time, the cost, and simplifies logprob decoding. Using logprobs improves statistical efficiency and alleviates decoding issues.
  4. Extract the raw preference by taking the logprob of the evaluated model (say M) normalized by the probability of M and m. For this step, you need to know how the outputs were randomized to know which one is M and which is m.
  5. Control the length bias of the preference by fitting a simple GLM on all the preferences from that model. This takes seconds even on a single CPU.
  6. Average all the length-controlled preferences over the AlpacaEval set to get the final LC win rate.

My understanding is that you currently have steps 1/2/3 and want a script to do 4/5/6 but you don't have enough information stored in your JSON to do step 4. Here's the part of alpaca_eval that randomizes the outputs: https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/processors.py#L53. As long as you add a field in your json for instruction/output_1/output_2/is_switch_output_1_output_2 then I can write a simple script for the rest.

@YannDubs
Copy link
Collaborator

YannDubs commented Apr 2, 2024

For your second question, I've tried that and other length debiasing. But found that it's less interpretable and less correlated with Arena. See more discussion here: #225. Feel free to ask questions there.

@bittersweet1999
Copy link
Author

Thanks for your response!
I want to make sure that what you mean is that I missed randomly swapping the order of the two comparison models, right? Actually, we have implemented this feature. That is, when GPT4 evaluates, the order of the two models it receives is randomly shuffled and has corresponding labels. So for a JSON, I can get the following information.
{ "instruction":"", "output_1":"", "generator_1":"", "dataset":"", "output_2":"", "generator_2":"", "datasplit":"", "annotator":"", "raw_completion":{ }, },
The information I am missing is the following three fields.
"preference":1.0000003413, "time_per_example":0.2617537733, "price_per_example":0.00806
So I wonder if the "preference" field will affect the results? If not, the current implementation is enough.

@bittersweet1999
Copy link
Author

And thanks for your explanation on length control win rate! I will take a further research on it. And actually, besides randomly swap the models order when take a gpt4 evaluation, we also implemented a double order, for example, alpacaeval has 805 questions, we will take 1610 times evaluation if use double order, and the model order will be both m and M to take a twice evaluation for the same question, and we will record the position bias of gpt4 in this process. I know that alpacaeval is using random order to reduce position bias, but did you consider to use double order, and use double order to calculate a new win rate?

@YannDubs
Copy link
Collaborator

YannDubs commented Apr 3, 2024

But after applying the LLM judge, do you undo the randomization in raw_completion? I.e. does the token m always refer to output_1 or will it refer to output_2 if the order was switched?
You have to take care of undoing the randomization. This can either be done directly in raw_completion by changing the tokens in the logprobs or it can be done once you have the preference.

Here's the actual script you are asking for. I'm highlighting the two natural places where you could undo randomization, in case you haven't

First, here's the code that makes a dataframe as you should have from OpenCompass

# Note that the annotations.json did not undo the randomization. 
df = pd.read_json("results/claude-3-opus-20240229/weighted_alpaca_eval_gpt4_turbo/annotations.json")
df = df[['instruction', 'output_1', 'generator_1',  'output_2', 'generator_2', 'annotator', 'raw_completion']]
print(df.columns)
# ['instruction', 'output_1', 'generator_1',  'output_2', 'generator_2', 'annotator', 'raw_completion']

# That's the actual randomization that AlpacaEval uses, but feel free to use whatever in your case.
arr_is_switched = df.apply(
    lambda x: alpaca_eval.utils.random_seeded_choice(
        seed=f"is_switched_outputs{x['instruction']}0", # some instruction dependent seed
        choices=[False, True],
    ),
    axis=1,
)

# Option 1 for undoing randomization
# This is the derandomization you need if you prefer derandomizing the raw_completion before computing preferences.
# Benefit: can be computed before the preference and will be easier to interpret from the annotations.json
def derandomize_tokens_inplace(x):
    if x is None: return
    # note that we only replace the top logprobs token as this is what `logprob_parser` uses
    for el in x["logprobs"]["content"][0]["top_logprobs"]:
        if el["token"] == "m":
            el["token"] = "M"
        elif el["token"] == "M":
            el["token"] = "m"

for i in range(len(df)):
    if df.iloc[i]["is_switched"]:
        derandomize_tokens_inplace(df.iloc[i]["raw_completion"])

# If you did everything correctly, then df would have the same format as yours. I.e. 
# ['instruction', 'output_1', 'generator_1',  'output_2', 'generator_2', 'annotator', 'raw_completion'] with undone randomization 

Now, here's the script you are looking for:

# Step 4: Extract preference
# Gets the preference of "m" vs "M". This can also be coded in a few lines. For historical reasons it returns values in 1 and 2. 
df["preference"] = df["raw_completion"].apply(lambda x: alpaca_eval.completion_parsers.logprob_parser(x, 
                                                            numerator_token="m",
                                                            denominator_tokens=["m", "M"],
                                                            is_binarize=False)[0] 
                                              if x is not None else float("nan"))

# Option 2 for undoing randomization
# This is the derandomization that you need if you apply it after computing the preferences.
# Benefit: simpler when there are many different potential prompts and when caching. This is what AlpacaEval uses.
# Only do the following if you didn't derandomize the raw_completion before. 
# df["preference"] = np.where(df["is_switched"], 3-df["preference"], df["preference"])

# Step 5 & 6: Length control and get result
metrics = alpaca_eval.metrics.get_length_controlled_winrate(df, 
                                                            save_weights_dir=None,
                                                            # adds 'glm_preference' to df
                                                            is_add_glm_preference_inplace=True)
print(metrics)
# {'win_rate': 28.989564293901843,
 # 'standard_error': 1.397245743554741,
 # 'n_wins': 223,
 # 'n_wins_base': 580,
 # 'n_draws': 0,
 # 'n_total': 803,
 # 'discrete_win_rate': 27.770859277708592,
 # 'length_controlled_winrate': 40.4779345913862}

# Save df as annotations.json
df.to_json("annotations.json", orient="records", indent=2)

@YannDubs
Copy link
Collaborator

YannDubs commented Apr 3, 2024

Concerning your second question with double. We wanted to keep the price and speed low, which is why we randomize. Given the low standard error, I doubt double will make much benefit. But if you run both I'd be curious to see how much difference it gives

@bittersweet1999
Copy link
Author

Thank you very much for the script. I will look into how to integrate it and will keep you updated with any news. Additionally, if there are more comparative results regarding the double and random experiments, I will also update you promptly. Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants