Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Implement merged input processor for LLaVA model #10676

Merged
merged 24 commits into from
Dec 7, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Nov 26, 2024

Part of #10114

This PR completes the basic support for merged input processor. In particular:

  • Added method for MultiModalProcessor to generate dummy data for profiling. The default implementation uses the placeholder tokens defined in its metadata.
  • Updated PlaceholderMap and V1 MMInputMapper to handle the outputs of MultiModalProcessor.
  • Added MultiModalRegistry.register_processor_by_metadata convenience function.

With these changes, the merged input processor can now be used for LLaVA model. Other models will be updated in subsequent PRs.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@ywang96
Copy link
Member

ywang96 commented Nov 26, 2024

This is no longer the case for the merged input processor because the processed tensors are outputted without modality information.

@DarkLight1337 What does mm_data look like for the merged input processor?

@DarkLight1337
Copy link
Member Author

Inside SequenceData:

multi_modal_data={'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'pixel_values': tensor([[[[-1.1207, -1.1207, -1.1353,  ...,  0.0909,  0.1347,  0.1785],
          [-1.1353, -1.1353, -1.1353,  ...,  0.2953,  0.2953,  0.2953],
          [-1.1499, -1.1499, -1.1353,  ...,  0.3829,  0.3391,  0.2953],
          ...,
          [ 0.6895,  1.2880,  1.6530,  ...,  1.1128,  0.9814,  1.0106],
          [ 0.6895,  1.2734,  1.6238,  ...,  0.9522,  1.0690,  1.0982],
          [ 0.6895,  1.3026,  1.6238,  ...,  1.0252,  1.0836,  1.2004]],

         [[-1.5720, -1.5720, -1.5870,  ..., -0.2663, -0.1913, -0.1313],
          [-1.5870, -1.5870, -1.5870,  ..., -0.1913, -0.1613, -0.1463],
          [-1.6020, -1.6020, -1.5870,  ..., -0.1163, -0.1313, -0.1313],
          ...,
          [ 0.9643,  1.5946,  1.9698,  ...,  1.2645,  1.1294,  1.1594],
          [ 0.9643,  1.5946,  1.9398,  ...,  1.0994,  1.2194,  1.2495],
          [ 0.9643,  1.6096,  1.9398,  ...,  1.1744,  1.2344,  1.3545]],

         [[-1.4376, -1.4376, -1.4518,  ..., -0.3000, -0.2857, -0.2289],
          [-1.4518, -1.4518, -1.4518,  ..., -0.4279, -0.4564, -0.4279],
          [-1.4660, -1.4660, -1.4518,  ..., -0.4422, -0.5275, -0.5417],
          ...,
          [ 0.8945,  1.3496,  1.7193,  ...,  1.1932,  1.0652,  1.0936],
          [ 0.8945,  1.3496,  1.6909,  ...,  1.0367,  1.1505,  1.1789],
          [ 0.8945,  1.3638,  1.6909,  ...,  1.1078,  1.1647,  1.2785]]]])}

multi_modal_placeholders={'image': [{'offset': 5, 'length': 576}]}

@DarkLight1337
Copy link
Member Author

Other models may have additional keys associated with the image modality, so we can't really hardcode this.

@ywang96
Copy link
Member

ywang96 commented Nov 26, 2024

Other models may have additional keys associated with the image modality, so we can't really hardcode this.

I see where the problem is. Can you see if this model works on V1?

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Nov 26, 2024

It still fails because of the hardcoded "image" access inside MMInputMapper. I think I need to update this code to skip the input mapper if a merged processor is found for the model.

Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Nov 26, 2024

Ok, the test now passes for V1. Embedding inputs don't work though (regardless of V1), let me fix this. Now there seems to be an issue with logprobs failing to be outputted.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Nov 27, 2024

I have updated PlaceholderMap.from_seq_group to work with the merged processor for V0 as well. PTAL.

@DarkLight1337 DarkLight1337 changed the title [3/N] Implement merged input processor for LLaVA model [3/N] Support merged input processor for LLaVA model Nov 27, 2024
@DarkLight1337 DarkLight1337 changed the title [3/N] Support merged input processor for LLaVA model [3/N] Support and implement merged input processor for LLaVA model Nov 27, 2024
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2024
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR actually seems to break Pixtral HF (I cannot run the example file on v0), so I'm blocking it for now until we fix it.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ywang96 ywang96 merged commit 955fa95 into vllm-project:main Dec 7, 2024
51 checks passed
@DarkLight1337 DarkLight1337 deleted the llava-mm-processor branch December 7, 2024 09:33
@DarkLight1337 DarkLight1337 changed the title [3/N] Support and implement merged input processor for LLaVA model Support and implement merged input processor for LLaVA model Dec 7, 2024
@DarkLight1337 DarkLight1337 changed the title Support and implement merged input processor for LLaVA model [VLM] Support and implement merged input processor for LLaVA model Dec 7, 2024
@DarkLight1337 DarkLight1337 changed the title [VLM] Support and implement merged input processor for LLaVA model [Model] Support and implement merged input processor for LLaVA model Dec 7, 2024
@DarkLight1337 DarkLight1337 changed the title [Model] Support and implement merged input processor for LLaVA model [Model] Implement merged input processor for LLaVA model Dec 7, 2024
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants