-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It is possible to use existing evaluation files to complete the evaluation? #263
Comments
Not sure how you ended up with that situation, but did you keep track of which instruction and output that gave the generation? if so I'd recommend formatting everything as such:
then save (or append if it exists) this file as if you did not save the instructions/outputs then there isn't much you can do as you won't even know what model M or m refers to. |
Thanks for your response.
OpenCompass is an evaluation platform that can partition tasks and support different model inference backends, thereby accelerating the model evaluation process. |
Great work@bittersweet1999! I love the idea of incorporating AlpacaEval LC to Opencompass and I'm happy to help. Just to make sure you understand: are you saying that you want a simple script that takes the above json and gives you the final preference? Can you depend on |
Well, thanks for your response! |
By the way, I'm also researching the issues of length control win rate and position bias win rate. I'm wondering if there's a simpler way to calculate the length control win rate. In our current in-house implementation, assuming the reply length of GPT-4 is A and the reply length of the comparative model is B, we calculate the win rate of this question by computing either |
For your first question, your current JSON is missing some information. Are you randomizing the outputs when you give it to GPT4 for eval? here are the high-level steps of AlpacaEval:
My understanding is that you currently have steps 1/2/3 and want a script to do 4/5/6 but you don't have enough information stored in your JSON to do step 4. Here's the part of alpaca_eval that randomizes the outputs: https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/processors.py#L53. As long as you add a field in your json for instruction/output_1/output_2/is_switch_output_1_output_2 then I can write a simple script for the rest. |
For your second question, I've tried that and other length debiasing. But found that it's less interpretable and less correlated with Arena. See more discussion here: #225. Feel free to ask questions there. |
Thanks for your response! |
And thanks for your explanation on length control win rate! I will take a further research on it. And actually, besides randomly swap the models order when take a gpt4 evaluation, we also implemented a |
But after applying the LLM judge, do you undo the randomization in Here's the actual script you are asking for. I'm highlighting the two natural places where you could undo randomization, in case you haven't First, here's the code that makes a dataframe as you should have from OpenCompass # Note that the annotations.json did not undo the randomization.
df = pd.read_json("results/claude-3-opus-20240229/weighted_alpaca_eval_gpt4_turbo/annotations.json")
df = df[['instruction', 'output_1', 'generator_1', 'output_2', 'generator_2', 'annotator', 'raw_completion']]
print(df.columns)
# ['instruction', 'output_1', 'generator_1', 'output_2', 'generator_2', 'annotator', 'raw_completion']
# That's the actual randomization that AlpacaEval uses, but feel free to use whatever in your case.
arr_is_switched = df.apply(
lambda x: alpaca_eval.utils.random_seeded_choice(
seed=f"is_switched_outputs{x['instruction']}0", # some instruction dependent seed
choices=[False, True],
),
axis=1,
)
# Option 1 for undoing randomization
# This is the derandomization you need if you prefer derandomizing the raw_completion before computing preferences.
# Benefit: can be computed before the preference and will be easier to interpret from the annotations.json
def derandomize_tokens_inplace(x):
if x is None: return
# note that we only replace the top logprobs token as this is what `logprob_parser` uses
for el in x["logprobs"]["content"][0]["top_logprobs"]:
if el["token"] == "m":
el["token"] = "M"
elif el["token"] == "M":
el["token"] = "m"
for i in range(len(df)):
if df.iloc[i]["is_switched"]:
derandomize_tokens_inplace(df.iloc[i]["raw_completion"])
# If you did everything correctly, then df would have the same format as yours. I.e.
# ['instruction', 'output_1', 'generator_1', 'output_2', 'generator_2', 'annotator', 'raw_completion'] with undone randomization Now, here's the script you are looking for: # Step 4: Extract preference
# Gets the preference of "m" vs "M". This can also be coded in a few lines. For historical reasons it returns values in 1 and 2.
df["preference"] = df["raw_completion"].apply(lambda x: alpaca_eval.completion_parsers.logprob_parser(x,
numerator_token="m",
denominator_tokens=["m", "M"],
is_binarize=False)[0]
if x is not None else float("nan"))
# Option 2 for undoing randomization
# This is the derandomization that you need if you apply it after computing the preferences.
# Benefit: simpler when there are many different potential prompts and when caching. This is what AlpacaEval uses.
# Only do the following if you didn't derandomize the raw_completion before.
# df["preference"] = np.where(df["is_switched"], 3-df["preference"], df["preference"])
# Step 5 & 6: Length control and get result
metrics = alpaca_eval.metrics.get_length_controlled_winrate(df,
save_weights_dir=None,
# adds 'glm_preference' to df
is_add_glm_preference_inplace=True)
print(metrics)
# {'win_rate': 28.989564293901843,
# 'standard_error': 1.397245743554741,
# 'n_wins': 223,
# 'n_wins_base': 580,
# 'n_draws': 0,
# 'n_total': 803,
# 'discrete_win_rate': 27.770859277708592,
# 'length_controlled_winrate': 40.4779345913862}
# Save df as annotations.json
df.to_json("annotations.json", orient="records", indent=2) |
Concerning your second question with |
Thank you very much for the script. I will look into how to integrate it and will keep you updated with any news. Additionally, if there are more comparative results regarding the double and random experiments, I will also update you promptly. Thank you again! |
Thanks for such a robust work!
It is possible to use existing evaluation files to complete the evaluation? For example, I have used my own script to complete the evaluation of GPT-4-1106 under the weighted_alpaca_eval_gpt4_turbo method and get each response of GPT-4-1106 like below
Can I use some script to directly convert this to the format of annotations.json to obtain the final results?
The text was updated successfully, but these errors were encountered: