-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add Support for Ovis1.6-Gemma2-9B Model #11240
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model implementation is coupling the image processing and model forwarding...
You can refer to the model implementation in llava.py
and phi3v.py
when adding model implementation.
def reset_parameters(self, mean=0., std=1.) -> None: | ||
init.normal_(self.weight, mean=mean, std=std) | ||
self._fill_padding_idx_with_zero() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def reset_parameters(self, mean=0., std=1.) -> None: | |
init.normal_(self.weight, mean=mean, std=std) | |
self._fill_padding_idx_with_zero() |
I think this is unnecessary since we always load weights from checkpoint.
@MULTIMODAL_REGISTRY.register_image_input_mapper() | ||
@MULTIMODAL_REGISTRY.register_max_image_tokens(get_max_ovis_image_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old input_mapper and input_processor pipeline will be deprecated soon. You will need to implement OvisMultimodalProcessor
to handle multimodal data processing instead.
You can refer to LlavaMultimodalProcessor
and Phi3VMultimodalProcessor
, because this model looks like llava-style ones.
image_placeholders.append(IMAGE_INDICATOR_IDS[4]) | ||
return image_placeholders | ||
|
||
def preprocess_image(self, image: PIL.Image.Image, max_partition=9, covering_threshold=0.9, convert_to_rgb=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be implemented in multi-modal processor instead of vision tokenizer.
max_image_tokens = min(max_tokens_per_grid,usable_vocab_size) | ||
return max_image_tokens | ||
|
||
class SiglipVisualTokenizer(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codes for multi-modal data processing should be decoupled from VisualTokenizer.
input_ids, inputs_embeds, _, _ = self.merge_multimodal( | ||
text_input_ids=inputs, | ||
text_attention_masks=kwargs.pop('attention_mask'), | ||
text_labels=None, | ||
pixel_values=kwargs.pop('pixel_values'), | ||
left_padding=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the forward function aligned with other LMMs. Because we will call the encoder in a separated runner for v1 in the future.
input_ids = [] | ||
labels = [] | ||
pixel_values = [] | ||
invalidate_label = False | ||
image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID] | ||
last_image_token_index = -1 | ||
for i in range(len(image_token_indices)): | ||
head = 0 if i == 0 else image_token_indices[i - 1] + 1 | ||
tail = image_token_indices[i] | ||
last_image_token_index = tail | ||
input_ids.extend(raw_input_ids[head:tail]) | ||
labels.extend(raw_labels[head:tail]) | ||
try: | ||
image = images[i] | ||
raw_pixel_values, image_placeholders = self.visual_tokenizer.preprocess_image( | ||
image, max_partition=max_partition) | ||
except Exception as e: | ||
if propagate_exception: | ||
raise e | ||
logging.exception(e) | ||
invalidate_label = True | ||
raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input() | ||
input_ids.extend(image_placeholders) | ||
labels.extend([IGNORE_ID] * len(image_placeholders)) | ||
pixel_values.append(raw_pixel_values) | ||
input_ids.extend(raw_input_ids[last_image_token_index + 1:]) | ||
labels.extend(raw_labels[last_image_token_index + 1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move image processing out of model forward implementation.
def get_conversation_formatter(self) -> ConversationFormatter: | ||
if getattr(self, 'conversation_formatter', None) is None: | ||
self.conversation_formatter = getattr(import_module(".configuration_ovis", __package__), | ||
self.config.conversation_formatter_class)(self.text_tokenizer) | ||
return self.conversation_formatter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
attention_masks = [] | ||
labels = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to prepare labels for inference only runtime?
any news? |
This pull request addresses issue #9638 by adding support for the Ovis1.6-Gemma2-9B model.
FIX #8972
FIX #9638