Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add descriptive_stats to all tasks #1475

Closed
Samoed opened this issue Nov 19, 2024 · 12 comments
Closed

Add descriptive_stats to all tasks #1475

Samoed opened this issue Nov 19, 2024 · 12 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@Samoed
Copy link
Collaborator

Samoed commented Nov 19, 2024

I've added descriptive statistics to almost all tasks, but I need help running some of them. To do this, you can run the script from the v2.0.0 branch.

import mteb
from tqdm import tqdm

for task_name in tqdm(
    [
        "FEVER",
        "HotpotQA",
        "MSMARCO",
        "MSMARCOv2",
        "TopiOCQA",
        "MIRACLRetrieval",
        "MrTidyRetrieval",
        "BrightRetrieval",
        "MultiLongDocRetrieval",
        "NeuCLIR2022Retrieval",
        "NeuCLIR2023Retrieval",
        "BibleNLPBitextMining",
        "FloresBitextMining",
        "SwissJudgementClassification",
        "MultiEURLEXMultilabelClassification",
        "MindSmallReranking",
        "WebLINXCandidatesReranking",
        "VoyageMMarcoReranking",
        "MIRACLReranking",
    ]
):
    task = mteb.get_task(task_name)
    stat = task.calculate_metadata_metrics()

cc @imenelydiaker

@Samoed Samoed added the help wanted Extra attention is needed label Nov 19, 2024
@imenelydiaker
Copy link
Contributor

imenelydiaker commented Nov 19, 2024

Hi @Samoed, thanks for all the efforts you put in! I'll run what's missing 🙂

@imenelydiaker imenelydiaker self-assigned this Nov 19, 2024
@imenelydiaker
Copy link
Contributor

imenelydiaker commented Nov 19, 2024

@Samoed question: do you know why we don't have the number of qrels for retieval datasets?

e.g., MIRACLRetrievalHardNegatives, we have num_queries and num_documents but not the number of qrels. We can maybe infer an average from average_relevant_docs_per_query, any idea why we don't have the exact number?

{
 'number_of_characters': 983901912,
 'num_samples': 2460458,
 'num_queries': 11076,
 'num_documents': 2449382,
 'min_document_length': 5,
 'average_document_length': 0.1694358005407078,
 'max_document_length': 176,
 'unique_documents': 2449382,
 'min_query_length': 1,
 'average_query_length': 88794.41124954858,
 'max_query_length': 48538,
 'unique_queries': 11076,
 'min_relevant_docs_per_query': 1,
 'average_relevant_docs_per_query': 2.3643011917659806,
 'max_relevant_docs_per_query': 20,
 'unique_relevant_docs': 98836,
 'num_instructions': None,
 'min_instruction_length': None,
 'average_instruction_length': None,
 'max_instruction_length': None,
 'unique_instructions': None,
 'min_top_ranked_per_query': None,
...
}

@Samoed
Copy link
Collaborator Author

Samoed commented Nov 19, 2024

I forgot to add them😅. Can you add missing fileds? I can recompute rest of the tasks

@dokato
Copy link
Collaborator

dokato commented Nov 19, 2024

@imenelydiaker how is this one going? do you need any help with that?

@imenelydiaker
Copy link
Contributor

@imenelydiaker how is this one going? do you need any help with that?

Launched it one hour ago, it's running I'm at MSMarcov2, the datasets are quite huge. Maybe we can split? Do you think you can do datasets starting from Neuclir*?

@imenelydiaker
Copy link
Contributor

@dokato if you still want to run some datasets, you'll need the code I added here to get the number of qrels.

@dokato
Copy link
Collaborator

dokato commented Nov 19, 2024

So just checkout to this branch above and run the script above from Neuclir? Sure, I can do that!

@imenelydiaker
Copy link
Contributor

So just checkout to this branch above and run the script above from Neuclir? Sure, I can do that!

Yes, better if you create a new branch from it so we won't have conflicts (because I'm working on it). For the PR, open it directly on the branch v2.0.0, thank you! 🙂

@dokato
Copy link
Collaborator

dokato commented Nov 21, 2024

@imenelydiaker @Samoed
So in this PR I added 7/10 of datasets from my part (starting from NeuCLIR2022Retrieval). Sadly, those datasets below failed repeatedly due to lack of memory:

  • "NeuCLIR2022Retrieval",
  • "NeuCLIR2023Retrieval",
  • "FloresBitextMining".

Currently, I have only access to D8s_v3 VM with 32gb of RAM and 8CPUs, 120gb SSD, so I'd say not too bad. So maybe some optimizations are needed, or even more beefy machine. I'll be OOK for the next 4 days, can take a look back at those if you have some pointers.

@Samoed
Copy link
Collaborator Author

Samoed commented Nov 21, 2024

I can try to run them on working machine

@imenelydiaker
Copy link
Contributor

@dokato thank you so much! @Samoed let me know if you can't run them 🙂

@Samoed
Copy link
Collaborator Author

Samoed commented Nov 23, 2024

I found a bug in the statistics calculation (I accidentally swapped docs and queries when calculating lengths) and will recalculate the incorrect stats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants