Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add giga embeddings #1741

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add giga embeddings #1741

wants to merge 6 commits into from

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Jan 9, 2025

Added InstructSentenceTransformerWrapper to use SentenceTransforme models with instructions.

Ref embeddings-benchmark/results#77
@ekolodin My results are a bit higher. Could you rerun your results using this implementation, or provide your implementation? My code for run

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.
Task Leaderboard PR
AmazonCounterfactualClassification 90.31 94.1493
EmotionClassification 73.1 92.075
ToxicConversationsClassification 75.37 90.1123
SprintDuplicateQuestions 86.3 93.487
TwitterSemEval2015 63.42 65.8234
SciDocsRR 88.01 84.5092
AskUbuntuDupQuestions 58.19 61.41
SCIDOCS 19.16 20.056
SciFact 72.9 67.707
STS16 81.09 79.6737
STSBenchmark 82.2 78.9945
SummEval 27.86 30.9884

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate the metadata add but otherwise it looks good. Of course lets wait until we have a look at the differences in score.

Comment on lines +268 to +269
use_instructions=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the training data annotation as well (we are going through models and adding that)

see_ #1561

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They haven't publish report yet, so I don't know anything about training dataset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do:

    public_training_data=False,  # no report public yet
    public_training_code=False, 
    training_datasets=None, 

Comment on lines +127 to +133
# to passage prompts won't be applied to passages
if (
not self.apply_instruction_to_passages
and prompt_type == PromptType.passage
and task.metadata.type == "s2p"
):
instruction = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to jasper and nv-embed this model doesn't use prompt for passages. I think that can be helpful to add this to base class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - Let us do that in another separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants