[Model] Implement merged input processor for LLaVA model #10676

DarkLight1337 · 2024-11-26T16:26:34Z

This PR completes the basic support for merged input processor. In particular:

Added method for MultiModalProcessor to generate dummy data for profiling. The default implementation uses the placeholder tokens defined in its metadata.
Updated PlaceholderMap and V1 MMInputMapper to handle the outputs of MultiModalProcessor.
Added MultiModalRegistry.register_processor_by_metadata convenience function.

With these changes, the merged input processor can now be used for LLaVA model. Other models will be updated in subsequent PRs.

…ceholders` Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

github-actions · 2024-11-26T16:26:48Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: DarkLight1337 <[email protected]>

ywang96 · 2024-11-26T16:34:57Z

This is no longer the case for the merged input processor because the processed tensors are outputted without modality information.

@DarkLight1337 What does mm_data look like for the merged input processor?

DarkLight1337 · 2024-11-26T16:36:22Z

Inside SequenceData:

multi_modal_data={'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'pixel_values': tensor([[[[-1.1207, -1.1207, -1.1353,  ...,  0.0909,  0.1347,  0.1785],
          [-1.1353, -1.1353, -1.1353,  ...,  0.2953,  0.2953,  0.2953],
          [-1.1499, -1.1499, -1.1353,  ...,  0.3829,  0.3391,  0.2953],
          ...,
          [ 0.6895,  1.2880,  1.6530,  ...,  1.1128,  0.9814,  1.0106],
          [ 0.6895,  1.2734,  1.6238,  ...,  0.9522,  1.0690,  1.0982],
          [ 0.6895,  1.3026,  1.6238,  ...,  1.0252,  1.0836,  1.2004]],

         [[-1.5720, -1.5720, -1.5870,  ..., -0.2663, -0.1913, -0.1313],
          [-1.5870, -1.5870, -1.5870,  ..., -0.1913, -0.1613, -0.1463],
          [-1.6020, -1.6020, -1.5870,  ..., -0.1163, -0.1313, -0.1313],
          ...,
          [ 0.9643,  1.5946,  1.9698,  ...,  1.2645,  1.1294,  1.1594],
          [ 0.9643,  1.5946,  1.9398,  ...,  1.0994,  1.2194,  1.2495],
          [ 0.9643,  1.6096,  1.9398,  ...,  1.1744,  1.2344,  1.3545]],

         [[-1.4376, -1.4376, -1.4518,  ..., -0.3000, -0.2857, -0.2289],
          [-1.4518, -1.4518, -1.4518,  ..., -0.4279, -0.4564, -0.4279],
          [-1.4660, -1.4660, -1.4518,  ..., -0.4422, -0.5275, -0.5417],
          ...,
          [ 0.8945,  1.3496,  1.7193,  ...,  1.1932,  1.0652,  1.0936],
          [ 0.8945,  1.3496,  1.6909,  ...,  1.0367,  1.1505,  1.1789],
          [ 0.8945,  1.3638,  1.6909,  ...,  1.1078,  1.1647,  1.2785]]]])}

multi_modal_placeholders={'image': [{'offset': 5, 'length': 576}]}

DarkLight1337 · 2024-11-26T16:38:17Z

Other models may have additional keys associated with the image modality, so we can't really hardcode this.

ywang96 · 2024-11-26T16:46:34Z

Other models may have additional keys associated with the image modality, so we can't really hardcode this.

I see where the problem is. Can you see if this model works on V1?

DarkLight1337 · 2024-11-26T16:52:52Z

It still fails because of the hardcoded "image" access inside MMInputMapper. I think I need to update this code to skip the input mapper if a merged processor is found for the model.

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2024-11-26T17:03:28Z

Ok, the test now passes for V1. ~~Embedding inputs don't work though (regardless of V1), let me fix this.~~ Now there seems to be an issue with logprobs failing to be outputted.

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2024-11-27T05:43:10Z

I have updated PlaceholderMap.from_seq_group to work with the merged processor for V0 as well. PTAL.

Signed-off-by: DarkLight1337 <[email protected]>

…acements` Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

ywang96

This PR actually seems to break Pixtral HF (I cannot run the example file on v0), so I'm blocking it for now until we fix it.

Signed-off-by: DarkLight1337 <[email protected]>

ywang96

LGTM!

…llm-project#10676) Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Roger Wang <[email protected]>

DarkLight1337 added 2 commits November 26, 2024 16:22

Add get_dummy_data to MultiModalProcessor; fix and test `iter_pla…

7b6c4f1

…ceholders` Signed-off-by: DarkLight1337 <[email protected]>

Use merged processor for llava model

de8332a

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added 2 commits November 26, 2024 16:29

format

8b6804e

Signed-off-by: DarkLight1337 <[email protected]>

Fix typo

26e3fdf

Signed-off-by: DarkLight1337 <[email protected]>

This was referenced Nov 26, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

[RFC]: Merge input processor and input mapper for multi-modal models #10114

Open

Enable the test to pass on V1

93d27bc

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added 5 commits November 26, 2024 17:28

Handle embedding inputs

d697241

Signed-off-by: DarkLight1337 <[email protected]>

format

ca11cc9

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'main' into llava-mm-processor

c32cba9

Fix wrong ndim

6c5c9ca

Signed-off-by: DarkLight1337 <[email protected]>

Factor out merge_placeholders

0194324

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 force-pushed the llava-mm-processor branch from a45e7a8 to a5e4834 Compare November 27, 2024 05:41

DarkLight1337 marked this pull request as ready for review November 27, 2024 05:43

DarkLight1337 requested review from ywang96, WoosukKwon, robertgshaw2-neuralmagic, njhill, comaniac and alexm-neuralmagic as code owners November 27, 2024 05:43

DarkLight1337 changed the title ~~[3/N] Implement merged input processor for LLaVA model~~ [3/N] Support merged input processor for LLaVA model Nov 27, 2024

DarkLight1337 changed the title ~~[3/N] Support merged input processor for LLaVA model~~ [3/N] Support and implement merged input processor for LLaVA model Nov 27, 2024

DarkLight1337 force-pushed the llava-mm-processor branch from a5e4834 to 1ba6df2 Compare November 27, 2024 05:50

Fix placeholder maps handling on V0

09618d0

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 force-pushed the llava-mm-processor branch from 1ba6df2 to 09618d0 Compare November 27, 2024 05:54

DarkLight1337 added 5 commits November 27, 2024 06:15

Remove unused dummy data code

5501458

Signed-off-by: DarkLight1337 <[email protected]>

Update dummy model

f3673c7

Signed-off-by: DarkLight1337 <[email protected]>

Enable overriding hf processor and tokenizer; fix `_apply_prompt_repl…

37bc008

…acements` Signed-off-by: DarkLight1337 <[email protected]>

Improve error handling in _resolve_matches; merge matches directly

4805a9e

Signed-off-by: DarkLight1337 <[email protected]>

Avoid hashing

8539008

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 mentioned this pull request Nov 27, 2024

[Model] Update multi-modal processor to support Mantis(LLaVA) model #10711

Merged

Update mapper tests

00244c7

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2024

DarkLight1337 added 4 commits December 4, 2024 03:50

Merge branch 'main' into llava-mm-processor

a00f541

Avoid calling input mapper in the first place

b31f8d4

Signed-off-by: DarkLight1337 <[email protected]>

Fix missing multi_modal_kwargs in dummy data

711cd38

Signed-off-by: DarkLight1337 <[email protected]>

Update dummy model

a11c6b2

Signed-off-by: DarkLight1337 <[email protected]>

ywang96 requested changes Dec 6, 2024

View reviewed changes

ywang96 and others added 3 commits December 6, 2024 04:51

proper processing

1d5a4d4

Patch pixtral processor

000736b

Signed-off-by: DarkLight1337 <[email protected]>

Fix double counting of mm_counts

1485c05

Signed-off-by: DarkLight1337 <[email protected]>

ywang96 approved these changes Dec 7, 2024

View reviewed changes

ywang96 merged commit 955fa95 into vllm-project:main Dec 7, 2024
51 checks passed

DarkLight1337 deleted the llava-mm-processor branch December 7, 2024 09:33

DarkLight1337 changed the title ~~[3/N] Support and implement merged input processor for LLaVA model~~ Support and implement merged input processor for LLaVA model Dec 7, 2024

DarkLight1337 changed the title ~~Support and implement merged input processor for LLaVA model~~ [VLM] Support and implement merged input processor for LLaVA model Dec 7, 2024

DarkLight1337 changed the title ~~[VLM] Support and implement merged input processor for LLaVA model~~ [Model] Support and implement merged input processor for LLaVA model Dec 7, 2024

DarkLight1337 changed the title ~~[Model] Support and implement merged input processor for LLaVA model~~ [Model] Implement merged input processor for LLaVA model Dec 7, 2024

alexm-neuralmagic mentioned this pull request Dec 10, 2024

[V1] VLM preprocessor hashing #11020

Merged

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[3/N] Support and implement merged input processor for LLaVA model (v…

5441f0c

…llm-project#10676) Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Implement merged input processor for LLaVA model #10676

[Model] Implement merged input processor for LLaVA model #10676

DarkLight1337 commented Nov 26, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 26, 2024

ywang96 commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

ywang96 commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024 •

edited

Loading

DarkLight1337 commented Nov 26, 2024 •

edited

Loading

DarkLight1337 commented Nov 27, 2024 •

edited

Loading

ywang96 left a comment •

edited

Loading

ywang96 left a comment

[Model] Implement merged input processor for LLaVA model #10676

[Model] Implement merged input processor for LLaVA model #10676

Conversation

DarkLight1337 commented Nov 26, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 26, 2024

ywang96 commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

ywang96 commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024 • edited Loading

DarkLight1337 commented Nov 26, 2024 • edited Loading

DarkLight1337 commented Nov 27, 2024 • edited Loading

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

ywang96 left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Nov 26, 2024 •

edited by github-actions bot

Loading

DarkLight1337 commented Nov 26, 2024 •

edited

Loading

DarkLight1337 commented Nov 26, 2024 •

edited

Loading

DarkLight1337 commented Nov 27, 2024 •

edited

Loading

ywang96 left a comment •

edited

Loading