-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Initial support of multimodal models for V1 re-arch #10699
Conversation
Signed-off-by: Roger Wang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model changes look good to me overall. We can consider passing the placeholder indices to get_input_embeddings
so that the relationship between input_ids
and multimodal_embeddings
becomes more explicit.
We should add tests to ensure that these models actually work in V1.
Hmm I don't think that alone will help actually because V1 assumes chunked prefill by default. Unless we're passing additional information such as computed tokens per request, the model runner doesn't know the original indices of |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! It'd be nice if @DarkLight1337 can also take another look.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
It seems that we have tests related to using image embeddings as input, indicating usage of this feature from the community (which will be a breaking change in V1 since we cannot take batched image embedding directly as input), thus I'm separating Qwen2VL out to unblock this PR. |
Signed-off-by: Roger Wang <[email protected]>
Sorry for the late reply, it looks good to me. |
…t#10699) Signed-off-by: Roger Wang <[email protected]>
…t#10699) Signed-off-by: Roger Wang <[email protected]>
This PR adds V1 support for selected image language models
Qwen2-VL (missing MRope implementation, will address in a separate PR)separated to [V1][VLM] Add V1-rearch image inference support for Qwen2-VL #10988