-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding WildBench #3150
Adding WildBench #3150
Conversation
8007c43
to
46f7a7c
Compare
90e8d56
to
a9a38d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe drop this file if we don't need it? Or do you intend to implement a pairwise annotator as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do plan to implement this. However, the original repo contains an ambiguity in the implementation of this metric. I have raised an issue on this in this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically they mentioned a machanism to length-debias in comparison of models, yet the released codebase do not seem to implement it
try: | ||
is_following = instruction.check_following(response) | ||
except Exception as e: | ||
print(f"Instruction following checking failed with error message {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use
hlog()
instead ofprint()
- start the message with "WARNING: "
- does this fail frequently? this basically means that if the judge model fails, the model under evaluation gets penalized (score defaults to 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding 3., this is very rare and the only type of exception I observed so far was due to langdetect failing to recognize languages, in that case the original codebase consider it as a successful following case, so I followed that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- and 2. have been updated in the latest change.
- name: wildbench_score | ||
display_name: WildBench Score | ||
short_display_name: WB Score | ||
description: Score of the AI output judged by GPT-4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPT-4o?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Changed.
taxonomy: | ||
task: "?" | ||
what: "?" | ||
who: "?" | ||
when: "?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fill these out (to the best of your knowledge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the latest change
dataset = datasets.load_dataset( | ||
"allenai/WildBench", | ||
self.subset, | ||
trust_remote_code=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is trust_remote_code
needed? :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. This is now removed
baseline_outputs = { | ||
f"{model}": datasets.load_dataset( | ||
"allenai/WildBench-V2-Model-Outputs", | ||
model, | ||
trust_remote_code=True, | ||
cache_dir=cache_dir, | ||
split="train", | ||
) | ||
for model in REFERENCE_MODELS | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like pairwise is half-implemented - I'd suggest finishing the implementation (only the annotator is missing) or removing all the pairwise components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment on the pairwise metric, I'm waiting on the authors's response, but I can also remove it first to keep the codebase clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm fine with keeping it in.
history.append(noun + round["content"]) | ||
history_text = "\n\n".join(history) | ||
user_query_text = row["conversation_input"][-1]["content"] | ||
checklist_text = "\n".join([f"- {checklist_item}" for checklist_item in row["checklist"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just keep this as a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch, the original code also initially kept the checklist items as a list but they later merged the check list, so I thought we could just store only the processed text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for doing the merging in the annotator and keeping this as a list, but either is fine (up to you).
src/helm/clients/vertexai_client.py
Outdated
contents.append( | ||
Content(role=role_mapping.get(msg["role"], "user"), parts=[Part.from_text(msg["content"])]) | ||
) | ||
content_key = "\n".join([msg["content"] for msg in request.messages]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need this - can just rely on the existing content
field (the cache key can be a nested dict).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm I understand it correctly, do you mean we can use the Content
objects as the cache key? But here we have a list of messages, which essentially gives a list of Content
objects, does a list of such nested dicts also work? If it does I'm curious about how that works - any quick pointers would be much appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, basically any nested dict that can be serialized to JSON can be used as a cache key. In terms of the implementation:
- MongoDB natively supports using a JSON object
- Other caches like SQLite serialize the JSON object to a string in a "canonical" way, which happens here:
helm/src/helm/common/key_value_store.py
Lines 9 to 11 in 40b3d23
def request_to_key(request: Mapping) -> str: | |
"""Normalize a `request` into a `key` so that we can hash using it.""" | |
return json.dumps(request, sort_keys=True) |
In terms of this PR, you could do something like
if:
cache_key["messages"] = request.messages
else:
cache_key["prompt"] = request.prompt
or
{
"prompt": request.messages or request.prompt
}
or something similar
) | ||
assert isinstance(dataset, datasets.Dataset) | ||
baseline_outputs = { | ||
f"{model}": datasets.load_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is quite expensive and also takes up quite a bit of space, I would prefer for the baseline_outputs
to only be loaded if we actually need them i.e. if we're in the pairwise comparison case. You can make this configurable by adding a argument in the constructor (and passing it from the run spec function via ScenarioSpec
args), setting an instance variable, and then checking it here..
Mostly looks good. Feel free to merge after addressing the remaining comments. Also, you need to resolve the conflict with main. |
f42304a
to
b5c7a7a
Compare
Added WildBench scenario, adapter, run specs, annotator, and metrics.
TODO:
Add a customized adapter that applies chat template for model inferenceAlign with original repo on the prompt format for GPT-as-a-judgeComment:
ChatAdapter
to use chatmessages
in theRequest
initialization, but it's most likely optimizable. Suggestions on this would be helpful and are very welcome.