-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Support Qwen2 embeddings and use tags to select model tests #10184
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
pooler_config, | ||
pooling_type=PoolingType.LAST, | ||
normalize=True, | ||
softmax=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be able to be controlled by the pooling args we spoke about offline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - these are the model's default values which can be overridden.
Nevermind, I was misled by the warning message. It's intended that |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
I have fixed the tests. |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Any update on this? |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
I have updated the tests so that Qwen2 embedding models are only tested on nightly. I have already tested them locally and confirmed them to pass. |
Is there any example usage of the Qwen2 embedding? The embeddings from hf and vllm cannot match with the official usage doc
|
After some debugging, I found the problem - you need to set |
Hmm, this is really odd. I found that the EOS/pad token fails to be added to the prompt when >>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-1.5B-instruct", trust_remote_code=False)
Qwen2TokenizerFast(name_or_path='Alibaba-NLP/gte-Qwen2-1.5B-instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False), added_tokens_decoder={
151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
>>> AutoTokenizer.from_pretrained("Alibaba-NLP/gte-Qwen2-1.5B-instruct", trust_remote_code=True)
Qwen2TokenizerFast(name_or_path='Alibaba-NLP/gte-Qwen2-1.5B-instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False), added_tokens_decoder={
151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
} |
…llm-project#10184) Signed-off-by: Tyler Michael Smith <[email protected]>
thanks for your excellent work! Does this mr support the bidirectional attention mechanisms in gte qwen2? I noticed that even manually adding eos in the input prompt, vllm will produce different embeddings than directly loading the original gte model. |
I got the test script by @Ecocytus to pass simply by adding |
I replace the model_name in @Ecocytus to my own path downloaded from huggingface. And I found the |
From my understanding, vLLM does use bidirectional attention mask (please correct me if I'm wrong @mgoin). |
gotcha! And I wonder whether bidirectional attention is enabled when loading the gte-qwen2-7b model, as the author mentioned
|
I just checked, |
I just found that |
I am a complete vlllm novice. Can I modify a few lines of code to force the loaded model to be encoder-only? I want to confirm whether the attention mask causes this embedding diff |
Actually, I am a bit suspicious about whether the attention mask is the real issue, since the 1.5B model is also supposed to use bidirectional attention mask yet works correctly with our decoder attention mask. |
You can try this patch diff --git a/tests/models/embedding/language/test_embedding.py b/tests/models/embedding/language/test_embedding.py
index c3f351ef..25cdfc81 100644
--- a/tests/models/embedding/language/test_embedding.py
+++ b/tests/models/embedding/language/test_embedding.py
@@ -19,8 +19,9 @@ from ..utils import check_embeddings_close
marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
pytest.param("BAAI/bge-multilingual-gemma2",
marks=[pytest.mark.core_model]),
- pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"),
- pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct"),
+ # pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"),
+ # pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct"),
+ pytest.param("Alibaba-NLP/gte-Qwen2-7B-instruct"),
],
)
@pytest.mark.parametrize("dtype", ["half"])
diff --git a/vllm/model_executor/models/qwen2.py b/vllm/model_executor/models/qwen2.py
index 370cff5f..844e93a6 100644
--- a/vllm/model_executor/models/qwen2.py
+++ b/vllm/model_executor/models/qwen2.py
@@ -27,7 +27,7 @@ import torch
from torch import nn
from transformers import Qwen2Config
-from vllm.attention import Attention, AttentionMetadata
+from vllm.attention import Attention, AttentionMetadata, AttentionType
from vllm.compilation.decorators import support_torch_compile
from vllm.config import CacheConfig, VllmConfig
from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size
@@ -164,11 +164,13 @@ class Qwen2Attention(nn.Module):
hidden_states: torch.Tensor,
kv_cache: torch.Tensor,
attn_metadata: AttentionMetadata,
+ attn_type: str = AttentionType.DECODER,
) -> torch.Tensor:
qkv, _ = self.qkv_proj(hidden_states)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
q, k = self.rotary_emb(positions, q, k)
- attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
+ attn_output = self.attn(q, k, v, kv_cache, attn_metadata,
+ attn_type=attn_type)
output, _ = self.o_proj(attn_output)
return output
@@ -216,7 +218,8 @@ class Qwen2DecoderLayer(nn.Module):
hidden_states: torch.Tensor,
kv_cache: torch.Tensor,
attn_metadata: AttentionMetadata,
- residual: Optional[torch.Tensor],
+ attn_type: str = AttentionType.DECODER,
+ residual: Optional[torch.Tensor] = None,
) -> Tuple[torch.Tensor, torch.Tensor]:
# Self Attention
if residual is None:
@@ -230,6 +233,7 @@ class Qwen2DecoderLayer(nn.Module):
hidden_states=hidden_states,
kv_cache=kv_cache,
attn_metadata=attn_metadata,
+ attn_type=attn_type,
)
# Fully Connected
@@ -292,6 +296,12 @@ class Qwen2Model(nn.Module):
self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
else:
self.norm = PPMissingLayer()
+
+ self._attn_type = {
+ "generate": AttentionType.DECODER,
+ "embedding": AttentionType.ENCODER_ONLY,
+ "draft": AttentionType.DECODER,
+ }[vllm_config.model_config.task]
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
return self.embed_tokens(input_ids)
@@ -322,7 +332,8 @@ class Qwen2Model(nn.Module):
hidden_states,
kv_caches[i - self.start_layer],
attn_metadata,
- residual,
+ attn_type=self._attn_type,
+ residual=residual,
)
if not get_pp_group().is_last_rank:
return IntermediateTensors({ and then run the unit test
|
For convenience, I don't run the pytest. But I changed the |
Ok so the 1.5B model wasn't trained using bidirectional mask as advertised, only the 7B model 🤔 In any case, @mgoin how should we make the attention method configurable? |
Also cc @youkaichao. Do you think adding this to model config is acceptable? |
We need to support all of these in
The problem now is the attention mask, which can be different even within the same task, so we can hardly set it automatically. |
My opinion: if it only occurs for QWen model, we can have an env var for it. If we find more models need it, we can add it to cli args. |
Ok I think I found a better way. We can use |
This works for me. Thanks a lot! |
Using |
If I load the model Alibaba-NLP/gte-Qwen2-1.5B-instruct locally, can I set this configuration trust_remote_code=True? |
Yes. |
Hello! I am a beginner in using LLMs, and I have a question. If I want to obtain the output of the last hidden layer of the qwen2-1.5B-instruct model as an embedding, can I use |
Yes, but you have to set |
A newer version of #5611 and #6282 since the source repo has been archived.
FIX #5600
FIX #5827
FIX #6015
FIX #9761