Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Add Support for Ovis1.6-Gemma2-9B Model #11240

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Player256
Copy link

@Player256 Player256 commented Dec 16, 2024

This pull request addresses issue #9638 by adding support for the Ovis1.6-Gemma2-9B model.

FIX #8972
FIX #9638

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@Isotr0py Isotr0py self-assigned this Dec 18, 2024
@Player256 Player256 marked this pull request as ready for review January 4, 2025 14:02
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model implementation is coupling the image processing and model forwarding...

You can refer to the model implementation in llava.py and phi3v.py when adding model implementation.

Comment on lines +264 to +266
def reset_parameters(self, mean=0., std=1.) -> None:
init.normal_(self.weight, mean=mean, std=std)
self._fill_padding_idx_with_zero()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def reset_parameters(self, mean=0., std=1.) -> None:
init.normal_(self.weight, mean=mean, std=std)
self._fill_padding_idx_with_zero()

I think this is unnecessary since we always load weights from checkpoint.

Comment on lines +268 to +269
@MULTIMODAL_REGISTRY.register_image_input_mapper()
@MULTIMODAL_REGISTRY.register_max_image_tokens(get_max_ovis_image_tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old input_mapper and input_processor pipeline will be deprecated soon. You will need to implement OvisMultimodalProcessor to handle multimodal data processing instead.

You can refer to LlavaMultimodalProcessor and Phi3VMultimodalProcessor, because this model looks like llava-style ones.

image_placeholders.append(IMAGE_INDICATOR_IDS[4])
return image_placeholders

def preprocess_image(self, image: PIL.Image.Image, max_partition=9, covering_threshold=0.9, convert_to_rgb=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be implemented in multi-modal processor instead of vision tokenizer.

max_image_tokens = min(max_tokens_per_grid,usable_vocab_size)
return max_image_tokens

class SiglipVisualTokenizer(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codes for multi-modal data processing should be decoupled from VisualTokenizer.

Comment on lines +474 to +480
input_ids, inputs_embeds, _, _ = self.merge_multimodal(
text_input_ids=inputs,
text_attention_masks=kwargs.pop('attention_mask'),
text_labels=None,
pixel_values=kwargs.pop('pixel_values'),
left_padding=True
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the forward function aligned with other LMMs. Because we will call the encoder in a separated runner for v1 in the future.

Comment on lines +427 to +453
input_ids = []
labels = []
pixel_values = []
invalidate_label = False
image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID]
last_image_token_index = -1
for i in range(len(image_token_indices)):
head = 0 if i == 0 else image_token_indices[i - 1] + 1
tail = image_token_indices[i]
last_image_token_index = tail
input_ids.extend(raw_input_ids[head:tail])
labels.extend(raw_labels[head:tail])
try:
image = images[i]
raw_pixel_values, image_placeholders = self.visual_tokenizer.preprocess_image(
image, max_partition=max_partition)
except Exception as e:
if propagate_exception:
raise e
logging.exception(e)
invalidate_label = True
raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input()
input_ids.extend(image_placeholders)
labels.extend([IGNORE_ID] * len(image_placeholders))
pixel_values.append(raw_pixel_values)
input_ids.extend(raw_input_ids[last_image_token_index + 1:])
labels.extend(raw_labels[last_image_token_index + 1:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move image processing out of model forward implementation.

Comment on lines +300 to +304
def get_conversation_formatter(self) -> ConversationFormatter:
if getattr(self, 'conversation_formatter', None) is None:
self.conversation_formatter = getattr(import_module(".configuration_ovis", __package__),
self.config.conversation_formatter_class)(self.text_tokenizer)
return self.conversation_formatter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Comment on lines +342 to +343
attention_masks = []
labels = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to prepare labels for inference only runtime?

@Swipe4057
Copy link

any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: Ovis1.6-Gemma2-9B [New Model]: Add support for Ovis models
3 participants