-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Evaluate missing splits #1525
Conversation
* implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great!
mteb/evaluation/MTEB.py
Outdated
merged_results = TaskResult( | ||
dataset_revision=existing_results.dataset_revision, | ||
task_name=existing_results.task_name, | ||
mteb_version=existing_results.mteb_version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we should do if existing result and new result have different version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution is only to extend results if the versions match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only to extend results if the versions match.
This sounds like a natural line in the sand for now, at least for key versions where the results object are drastically different (e.g. pre-1.11.0). I can open an improvement issue to handle results from different versions?
[edit]: Hmm doesn't TaskResult.from_disk
handle version difference already? There are methods like _convert_from_before_v1_11_0
and checks for pre_v_12_48
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe to use verion from new_results? Because it will dump in format of running version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
Separately, I don't think this currently takes the difference in dataset version into consideration. I think this is already an existing gap, where we only check whether the same model + model revision has been run, but not check for dataset version. We should probably address it here before merging. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most datasets downloaded by revision, so I don't think we need more checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then this should be good to merge once the tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to merge. It might be nice to create a "deprecation version", e.g. before 1.11.0, we can then slowly outdated old results as needed. However, this is probably something for a more general discussion, before any implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great only a few minor things
Thanks @Samoed and @KennethEnevoldsen for reviewing, and @thivyanth for the initial iteration! Merging now. |
Fixes #1260
overwrite=True
to workExample Usage
Checklist
make test
.make lint
.