-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Integrating ChemTEB #1708
base: main
Are you sure you want to change the base?
Conversation
- Move `PubChemSMILESBitextMining` to `eng` folder - Add citations for tasks involving SDS, NQ, Hotpot, PubChem data - Update Clustering tasks `category` - Change `main_score` for `PubChemAISentenceParaphrasePC`
- `eval_langs` fixed - Dataset path was fixed for two datasets - Metadata was completed for all tasks, mainly following fields: `date`, `task_subtypes`, `dialect`, `sample_creation` - ruff lint - rename `nomic_bert_models.py` to `nomic_bert_model.py` and update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?
mteb/tasks/PairClassification/eng/PubChemAISentenceParaphrasePC.py
Outdated
Show resolved
Hide resolved
Hi, thank you for the review. I haven't worked with the mentioned repository before. Could you clarify how I should proceed? What do you mean by "paper"? |
You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1 |
The detailed score for each task-model pair is not reported in our paper (I have it locally though); instead, an average per category is provided. Additionally, the current PR does not include the exact same tasks: three tasks have been removed, one new multilingual task has been added, and eight tasks have been merged into two. Therefore, the average scores may not align perfectly. However, I can run the benchmark and push the scores to the results repository. |
- Text should be truncated for amazon text embedding models. - `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens. - Two datasets are updated, dropping very long samples (len > 99th percentile)
I updated Error Example: ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation:
400 Bad Request: Too many input tokens. Max input tokens: 8192, request input token count: 11530
To handle long text, there are two approaches for truncation: Static Approach Dynamic Approach try:
all_embeddings.append(self._embed_amazon(sentence))
except ValidationError as e:
pattern = "request input token count:" + r"\s*(\d+)"
match = re.search(pattern, str(e))
if match:
num_tokens = int(match.group(1))
ratio = 0.9 * (self._max_tokens / num_tokens)
max_sequence_length = int(len(sentence) * ratio)
all_embeddings.append(self._embed_amazon(sentence[:max_sequence_length]))
else:
raise e I used the first method, but if you think the second one is better, let me know. |
While running the benchmark again for all tasks, I ran into an error while evaluating a classification task. Digging deeper, I noticed something weird. OpenAI text embedding models generally have a context length of 8192 (if you go over that, the API throws an error). But it looks like the from openai import OpenAI
import tiktoken
import numpy as np
client = OpenAI()
encoding = tiktoken.get_encoding("cl100k_base")
sn = 'Hello World' * 5000
print(f'Num tokens: {len(encoding.encode(sn))}')
truncated_sentence = encoding.encode(sn)[:8191]
truncated_sentence = encoding.decode(truncated_sentence)
response = client.embeddings.create(
input=truncated_sentence,
model="text-embedding-ada-002",
encoding_format="float",
)
em = np.array(response.data[0].embedding)
print(f'Null values: {np.isnan(em.astype(np.float32)).sum()}') |
Yes, we have some inconsistency with For Bedrock token count, I suggest adding a static calculation for the number of tokens along with a dynamic approach to ensure the models work correctly. Additionally, could you create a table here comparing the results from the paper with your results? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Originally posted by @HSILA in embeddings-benchmark/results#89 (comment) |
As discussed in issue #1585, I have integrated ChemTEB into the MTEB codebase and submitted this PR for merging. I have added 28 tasks and two model families:
amazon_models.py
andcohere_bedrock_models.py
. These models were included to meet our internal requirements, as they are not currently available in MTEB. I thought it might be good to have them, but I can remove them if necessary. I have tested the mentioned models with all the tasks, you can see the result here: chemteb_results_pr.csvCloses #1585
Checklist
make test
.make lint
.docs/benchmarks.md
anddocs/tasks.md
ChemTEB
as a benchmark inbenchmark.py
Adding datasets checklist
Reason for dataset addition: To address the lack of benchmarks tailored to chemical sciences, where text involves complex entity names, expressions, and specialized representations like SMILES codes, not typically found in general domain datasets.
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.Adding a model checklist
mteb.get_model(model_name, revision)
andmteb.get_model_meta(model_name, revision)