Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Evaluate missing splits #1525

Merged
merged 7 commits into from
Nov 29, 2024
Merged

feat: Evaluate missing splits #1525

merged 7 commits into from
Nov 29, 2024

Conversation

isaac-chung
Copy link
Collaborator

@isaac-chung isaac-chung commented Nov 28, 2024

Fixes #1260

  • Based on fix: evaluate missing splits #1268 and addresses most of the comments
  • Once run, the returned result will contain the missing splits as well as the already run splits
  • Requires overwrite=True to work
  • Changed tests to use the same MTEB object instead of a new one

Example Usage

from mteb import MTEB, get_tasks, get_model

tasks = get_tasks(tasks=["Banking77Classification"])
model = get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Runs only one split
evaluation = MTEB(tasks=tasks)
results = evaluation.run(
    model,
    eval_splits=["val"],
    output_folder=str(tmp_path / "testcase2"),
    verbosity=2,
)

# Will run the missing split and return results of both splits.
results2 = evaluation.run(
    model,
    eval_splits=["val", "test"],
    output_folder=str(tmp_path / "testcase2"),
    verbosity=2,
    overwrite_results=True,
)

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

thivyanth and others added 3 commits November 27, 2024 01:38
* implement partial evaluation for missing splits

* lint

* requested changes done from scratch

* test for missing split evaluation added

* uncomment test

* lint

* avoid circular import

* use TaskResult

* skip tests for now

---------

Co-authored-by: Isaac Chung <[email protected]>
@isaac-chung isaac-chung changed the title feat: Eval missing splits feat: Evaluate missing splits Nov 28, 2024
Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great!

mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
merged_results = TaskResult(
dataset_revision=existing_results.dataset_revision,
task_name=existing_results.task_name,
mteb_version=existing_results.mteb_version,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we should do if existing result and new result have different version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution is only to extend results if the versions match.

Copy link
Collaborator Author

@isaac-chung isaac-chung Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only to extend results if the versions match.

This sounds like a natural line in the sand for now, at least for key versions where the results object are drastically different (e.g. pre-1.11.0). I can open an improvement issue to handle results from different versions?

[edit]: Hmm doesn't TaskResult.from_disk handle version difference already? There are methods like _convert_from_before_v1_11_0 and checks for pre_v_12_48.

Copy link
Collaborator

@Samoed Samoed Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe to use verion from new_results? Because it will dump in format of running version

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Separately, I don't think this currently takes the difference in dataset version into consideration. I think this is already an existing gap, where we only check whether the same model + model revision has been run, but not check for dataset version. We should probably address it here before merging. wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most datasets downloaded by revision, so I don't think we need more checks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then this should be good to merge once the tests pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to merge. It might be nice to create a "deprecation version", e.g. before 1.11.0, we can then slowly outdated old results as needed. However, this is probably something for a more general discussion, before any implementation.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great only a few minor things

mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
@isaac-chung
Copy link
Collaborator Author

Thanks @Samoed and @KennethEnevoldsen for reviewing, and @thivyanth for the initial iteration! Merging now.

@isaac-chung isaac-chung merged commit 8e12250 into main Nov 29, 2024
10 checks passed
@isaac-chung isaac-chung deleted the eval-missing-splits branch November 29, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only skip benchmarking if split results are the same too
4 participants