Add `descriptive_stats` to all tasks #1475

Samoed · 2024-11-19T12:22:38Z

I've added descriptive statistics to almost all tasks, but I need help running some of them. To do this, you can run the script from the v2.0.0 branch.

import mteb
from tqdm import tqdm

for task_name in tqdm(
    [
        "FEVER",
        "HotpotQA",
        "MSMARCO",
        "MSMARCOv2",
        "TopiOCQA",
        "MIRACLRetrieval",
        "MrTidyRetrieval",
        "BrightRetrieval",
        "MultiLongDocRetrieval",
        "NeuCLIR2022Retrieval",
        "NeuCLIR2023Retrieval",
        "BibleNLPBitextMining",
        "FloresBitextMining",
        "SwissJudgementClassification",
        "MultiEURLEXMultilabelClassification",
        "MindSmallReranking",
        "WebLINXCandidatesReranking",
        "VoyageMMarcoReranking",
        "MIRACLReranking",
    ]
):
    task = mteb.get_task(task_name)
    stat = task.calculate_metadata_metrics()

cc @imenelydiaker

The text was updated successfully, but these errors were encountered:

imenelydiaker · 2024-11-19T13:16:37Z

Hi @Samoed, thanks for all the efforts you put in! I'll run what's missing 🙂

imenelydiaker · 2024-11-19T14:55:46Z

@Samoed question: do you know why we don't have the number of qrels for retieval datasets?

e.g., MIRACLRetrievalHardNegatives, we have num_queries and num_documents but not the number of qrels. We can maybe infer an average from average_relevant_docs_per_query, any idea why we don't have the exact number?

{
 'number_of_characters': 983901912,
 'num_samples': 2460458,
 'num_queries': 11076,
 'num_documents': 2449382,
 'min_document_length': 5,
 'average_document_length': 0.1694358005407078,
 'max_document_length': 176,
 'unique_documents': 2449382,
 'min_query_length': 1,
 'average_query_length': 88794.41124954858,
 'max_query_length': 48538,
 'unique_queries': 11076,
 'min_relevant_docs_per_query': 1,
 'average_relevant_docs_per_query': 2.3643011917659806,
 'max_relevant_docs_per_query': 20,
 'unique_relevant_docs': 98836,
 'num_instructions': None,
 'min_instruction_length': None,
 'average_instruction_length': None,
 'max_instruction_length': None,
 'unique_instructions': None,
 'min_top_ranked_per_query': None,
...
}

Samoed · 2024-11-19T15:02:54Z

I forgot to add them😅. Can you add missing fileds? I can recompute rest of the tasks

dokato · 2024-11-19T18:59:49Z

@imenelydiaker how is this one going? do you need any help with that?

imenelydiaker · 2024-11-19T19:05:42Z

@imenelydiaker how is this one going? do you need any help with that?

Launched it one hour ago, it's running I'm at MSMarcov2, the datasets are quite huge. Maybe we can split? Do you think you can do datasets starting from Neuclir*?

imenelydiaker · 2024-11-19T19:54:48Z

@dokato if you still want to run some datasets, you'll need the code I added here to get the number of qrels.

dokato · 2024-11-19T20:02:22Z

So just checkout to this branch above and run the script above from Neuclir? Sure, I can do that!

imenelydiaker · 2024-11-19T20:10:00Z

So just checkout to this branch above and run the script above from Neuclir? Sure, I can do that!

Yes, better if you create a new branch from it so we won't have conflicts (because I'm working on it). For the PR, open it directly on the branch v2.0.0, thank you! 🙂

dokato · 2024-11-21T13:31:49Z

@imenelydiaker @Samoed
So in this PR I added 7/10 of datasets from my part (starting from NeuCLIR2022Retrieval). Sadly, those datasets below failed repeatedly due to lack of memory:

"NeuCLIR2022Retrieval",
"NeuCLIR2023Retrieval",
"FloresBitextMining".

Currently, I have only access to D8s_v3 VM with 32gb of RAM and 8CPUs, 120gb SSD, so I'd say not too bad. So maybe some optimizations are needed, or even more beefy machine. I'll be OOK for the next 4 days, can take a look back at those if you have some pointers.

Samoed · 2024-11-21T14:20:10Z

I can try to run them on working machine

imenelydiaker · 2024-11-21T15:20:49Z

@dokato thank you so much! @Samoed let me know if you can't run them 🙂

Samoed · 2024-11-23T12:33:03Z

I found a bug in the statistics calculation (I accidentally swapped docs and queries when calculating lengths) and will recalculate the incorrect stats

Samoed added the help wanted Extra attention is needed label Nov 19, 2024

imenelydiaker self-assigned this Nov 19, 2024

dokato mentioned this issue Nov 21, 2024

1475 add descriptive stats to all tasks v2 #1482

Merged

2 tasks

This was referenced Nov 25, 2024

Fix: retrieval stats #1496

Merged

Solution for transforming retrieval datasets into parquet #1090

Open

fix: hatespeech filipino #1522

Merged

Samoed closed this as completed Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `descriptive_stats` to all tasks #1475

Add `descriptive_stats` to all tasks #1475

Samoed commented Nov 19, 2024 •

edited

Loading

imenelydiaker commented Nov 19, 2024 •

edited

Loading

imenelydiaker commented Nov 19, 2024 •

edited

Loading

Samoed commented Nov 19, 2024 •

edited

Loading

dokato commented Nov 19, 2024

imenelydiaker commented Nov 19, 2024

imenelydiaker commented Nov 19, 2024

dokato commented Nov 19, 2024

imenelydiaker commented Nov 19, 2024

dokato commented Nov 21, 2024 •

edited

Loading

Samoed commented Nov 21, 2024

imenelydiaker commented Nov 21, 2024

Samoed commented Nov 23, 2024

Add descriptive_stats to all tasks #1475

Add descriptive_stats to all tasks #1475

Comments

Samoed commented Nov 19, 2024 • edited Loading

imenelydiaker commented Nov 19, 2024 • edited Loading

imenelydiaker commented Nov 19, 2024 • edited Loading

Samoed commented Nov 19, 2024 • edited Loading

dokato commented Nov 19, 2024

imenelydiaker commented Nov 19, 2024

imenelydiaker commented Nov 19, 2024

dokato commented Nov 19, 2024

imenelydiaker commented Nov 19, 2024

dokato commented Nov 21, 2024 • edited Loading

Samoed commented Nov 21, 2024

imenelydiaker commented Nov 21, 2024

Samoed commented Nov 23, 2024

Add `descriptive_stats` to all tasks #1475

Add `descriptive_stats` to all tasks #1475

Samoed commented Nov 19, 2024 •

edited

Loading

imenelydiaker commented Nov 19, 2024 •

edited

Loading

imenelydiaker commented Nov 19, 2024 •

edited

Loading

Samoed commented Nov 19, 2024 •

edited

Loading

dokato commented Nov 21, 2024 •

edited

Loading