New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Model] Add Support for Ovis1.6-Gemma2-9B Model #11240

Open

Player256 wants to merge 6 commits into vllm-project:main from Player256:ovis

Player256 commented Dec 16, 2024 •

edited by github-actions bot

Loading

This pull request addresses issue #9638 by adding support for the Ovis1.6-Gemma2-9B model.

FIX #8972
FIX #9638


          Draft PR

c3242bc

github-actions bot commented Dec 16, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 mentioned this pull request

[RFC]: Multi-modality Support on vLLM #4194

Open

76 tasks

Isotr0py self-assigned this

Player256 and others added 3 commits

January 4, 2025 18:56


          Merge branch 'vllm-project:main' into ovis

d02fcb9


          modified the model code

1c0880a


          Merge branch 'ovis' of https://github.com/Player256/vllm into ovis

6bf22e3

Player256 marked this pull request as ready for review

January 4, 2025 14:02

Isotr0py reviewed

View reviewed changes

Collaborator

Isotr0py left a comment

This model implementation is coupling the image processing and model forwarding...

You can refer to the model implementation in llava.py and phi3v.py when adding model implementation.

vllm/model_executor/models/ovis.py

Comment on lines +264 to +266

+                  def reset_parameters(self, mean=0., std=1.) -> None:
+                      init.normal_(self.weight, mean=mean, std=std)
+                      self._fill_padding_idx_with_zero()

Collaborator

Isotr0py Jan 4, 2025

Suggested change

      
                def reset_parameters(self, mean=0., std=1.) -> None:
          
                    init.normal_(self.weight, mean=mean, std=std)
          
                    self._fill_padding_idx_with_zero()

I think this is unnecessary since we always load weights from checkpoint.

vllm/model_executor/models/ovis.py

Comment on lines +268 to +269

		@MULTIMODAL_REGISTRY.register_image_input_mapper()
		@MULTIMODAL_REGISTRY.register_max_image_tokens(get_max_ovis_image_tokens)

Collaborator

Isotr0py Jan 4, 2025

The old input_mapper and input_processor pipeline will be deprecated soon. You will need to implement OvisMultimodalProcessor to handle multimodal data processing instead.

You can refer to LlavaMultimodalProcessor and Phi3VMultimodalProcessor, because this model looks like llava-style ones.

vllm/model_executor/models/ovis.py

+                      image_placeholders.append(IMAGE_INDICATOR_IDS[4])
+                      return image_placeholders
+                  def preprocess_image(self, image: PIL.Image.Image, max_partition=9, covering_threshold=0.9, convert_to_rgb=True):

Collaborator

Isotr0py Jan 4, 2025

I think this should be implemented in multi-modal processor instead of vision tokenizer.

vllm/model_executor/models/ovis.py

+                  max_image_tokens = min(max_tokens_per_grid,usable_vocab_size)
+                  return max_image_tokens
+              class SiglipVisualTokenizer(nn.Module):

Collaborator

Isotr0py Jan 4, 2025

Codes for multi-modal data processing should be decoupled from VisualTokenizer.

vllm/model_executor/models/ovis.py

Comment on lines +474 to +480

+                      input_ids, inputs_embeds, _, _ = self.merge_multimodal(
+                          text_input_ids=inputs,
+                          text_attention_masks=kwargs.pop('attention_mask'),
+                          text_labels=None,
+                          pixel_values=kwargs.pop('pixel_values'),
+                          left_padding=True
+                      )

Collaborator

Isotr0py Jan 4, 2025

Please keep the forward function aligned with other LMMs. Because we will call the encoder in a separated runner for v1 in the future.

vllm/model_executor/models/ovis.py

Comment on lines +427 to +453

+                      input_ids = []
+                      labels = []
+                      pixel_values = []
+                      invalidate_label = False
+                      image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID]
+                      last_image_token_index = -1
+                      for i in range(len(image_token_indices)):
+                          head = 0 if i == 0 else image_token_indices[i - 1] + 1
+                          tail = image_token_indices[i]
+                          last_image_token_index = tail
+                          input_ids.extend(raw_input_ids[head:tail])
+                          labels.extend(raw_labels[head:tail])
+                          try:
+                              image = images[i]
+                              raw_pixel_values, image_placeholders = self.visual_tokenizer.preprocess_image(
+                                  image, max_partition=max_partition)
+                          except Exception as e:
+                              if propagate_exception:
+                                  raise e
+                              logging.exception(e)
+                              invalidate_label = True
+                              raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input()
+                          input_ids.extend(image_placeholders)
+                          labels.extend([IGNORE_ID] * len(image_placeholders))
+                          pixel_values.append(raw_pixel_values)
+                      input_ids.extend(raw_input_ids[last_image_token_index + 1:])
+                      labels.extend(raw_labels[last_image_token_index + 1:])

Collaborator

Isotr0py Jan 4, 2025

Please move image processing out of model forward implementation.

vllm/model_executor/models/ovis.py

Comment on lines +300 to +304

+                  def get_conversation_formatter(self) -> ConversationFormatter:
+                      if getattr(self, 'conversation_formatter', None) is None:
+                          self.conversation_formatter = getattr(import_module(".configuration_ovis", __package__),
+                                                                self.config.conversation_formatter_class)(self.text_tokenizer)
+                      return self.conversation_formatter

Collaborator

Isotr0py Jan 4, 2025

Why do we need this?

Isotr0py reviewed

View reviewed changes

vllm/model_executor/models/ovis.py

Comment on lines +342 to +343

		attention_masks = []
		labels = []

Collaborator

Isotr0py Jan 4, 2025

Why do we need to prepare labels for inference only runtime?

Player256 added 2 commits

January 7, 2025 23:01


          Merge branch 'vllm-project:main' into ovis

0ff4eb4


          Merge branch 'vllm-project:main' into ovis

dc2f81f

Swipe4057 commented Jan 15, 2025

any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet