Add decriptive stat almost to all datasets #1466

Samoed · 2024-11-16T19:57:39Z

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

I ran almost all tasks, except for those with issues mentioned in #1465 and some large datasets that couldn’t fit on my laptop. I’ll try running those on Kaggle. Additionally, I fixed a few bugs related to dataset loading.

I don’t have statistics for the larger tasks because they exceed the RAM capacity of my laptop, and Kaggle's disk space is insufficient. I’ll need some assistance to run them.

Tasks without stats:

Script to run these tasks

import mteb
from tqdm import tqdm

for task_name in tqdm(
    [
        "HotpotQA", 
        "MSMARCO",
        "MSMARCOv2",
        "MIRACLRetrieval",
        "MrTidyRetrieval",
        "TopiOCQA",
        "MultiLongDocRetrieval",
        "NeuCLIR2022Retrieval",
        "NeuCLIR2023Retrieval",
        "MultiEURLEXMultilabelClassification",
        "SwissJudgementClassification",
        "BibleNLPBitextMining",
        "FloresBitextMining",
        "MIRACLReranking",
        "MindSmallReranking",
        "VoyageMMarcoReranking",
        "WebLINXCandidatesReranking",
    ]
):
    task = mteb.get_task(task_name)
    stat = task.calculate_metadata_metrics()

KennethEnevoldsen · 2024-11-18T09:55:03Z

@imenelydiaker just so that you know this exists

KennethEnevoldsen

Feel free to create an issue on this and merge in the current stats

Samoed · 2024-11-18T10:03:57Z

I will downgrade datasets version and run tasks that were broken

Samoed added 6 commits November 16, 2024 22:11

run tasks

f11ad28

remove test script

2155427

lint

db3f290

remove cache

63d082a

fix sickbrsts

36cdb54

fix tests

88c38d1

KennethEnevoldsen approved these changes Nov 18, 2024

View reviewed changes

add datasets

9208f84

Samoed marked this pull request as ready for review November 18, 2024 10:58

Samoed merged commit 0e9b6fd into v2.0.0 Nov 18, 2024
10 checks passed

Samoed deleted the add_descriptive_stats branch November 18, 2024 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decriptive stat almost to all datasets #1466

Add decriptive stat almost to all datasets #1466

Samoed commented Nov 16, 2024 •

edited

Loading

KennethEnevoldsen commented Nov 18, 2024

KennethEnevoldsen left a comment

Samoed commented Nov 18, 2024

Add decriptive stat almost to all datasets #1466

Add decriptive stat almost to all datasets #1466

Conversation

Samoed commented Nov 16, 2024 • edited Loading

Checklist

KennethEnevoldsen commented Nov 18, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Samoed commented Nov 18, 2024

Samoed commented Nov 16, 2024 •

edited

Loading