Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Add Support for Ovis1.6-Gemma2-9B Model #11240

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Player256
Copy link

@Player256 Player256 commented Dec 16, 2024

This pull request addresses issue #9638 by adding support for the Ovis1.6-Gemma2-9B model.

FIX #8972
FIX #9638

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@Isotr0py Isotr0py self-assigned this Dec 18, 2024
@Player256 Player256 marked this pull request as ready for review January 4, 2025 14:02
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model implementation is coupling the image processing and model forwarding...

You can refer to the model implementation in llava.py and phi3v.py when adding model implementation.

@Swipe4057
Copy link

any news?

@Player256
Copy link
Author

Hey @Isotr0py could you give this PR a review?

Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the model implementation becomes better, there are still lots of things needed to be done:

  • Update the documentation to mention this supported model in docs/source/models/supported_models.md
  • Add example in examples/offline_inference/vision_language.py, if this model support multi-image inputs, please also update examples/offline_inference/vision_language_multi_image.py
  • Add model correctness tests in tests/models/decoder_only/vision_language/test_models.py and processor correctness test in tests/models/multimodal/processing/test_common.py
  • Update tests/models/registry.py with model information.

Comment on lines 456 to 463
# def merge_multimodal(
# self,
# text_input_ids: torch.Tensor,
# text_attention_masks: torch.Tensor,
# text_labels: Optional[torch.Tensor],
# pixel_values: List[Optional[torch.Tensor]],
# left_padding: bool = False
# ):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this unused code.

Comment on lines 434 to 437
@cached_property
def sampler(self):
if hasattr(self.llm,"sampler"):
return self.llm.sampler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing fallback?

@Isotr0py
Copy link
Collaborator

Isotr0py commented Feb 4, 2025

Please address pre-commit linting errors as well.

@Player256
Copy link
Author

Please address pre-commit linting errors as well.

Thanks @Isotr0py for the review, I'll get back to it.

@ismael-dm
Copy link
Contributor

will this PR cover also new Ovis 2 models? https://huggingface.co/collections/AIDC-AI/ovis2-67ab36c7e497429034874464

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 27, 2025
@Player256 Player256 marked this pull request as draft February 27, 2025 09:06
@Player256
Copy link
Author

I'll add the tests for it.

@@ -270,6 +270,7 @@ def check_available_online(
trust_remote_code=True),
"NVLM_D": _HfExamplesInfo("nvidia/NVLM-D-72B",
trust_remote_code=True),
"OvisForConditionalGeneration": _HfExamplesInfo("AIDC-AI/Ovis1.6-Gemma2-9B",trust_remote_code=True), # noqa: E501
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"OvisForConditionalGeneration": _HfExamplesInfo("AIDC-AI/Ovis1.6-Gemma2-9B",trust_remote_code=True), # noqa: E501
"OvisForConditionalGeneration": _HfExamplesInfo("AIDC-AI/Ovis1.6-Llama3.2-3B",trust_remote_code=True), # noqa: E501

We can use a smaller model for registry and test: AIDC-AI/Ovis1.6-Llama3.2-3B

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Signed-off-by: Isotr0py <[email protected]>
Comment on lines +342 to +347
def _get_prompt_replacements(
self,
mm_items: MultiModalDataItems,
hf_processor_mm_kwargs: Mapping[str, Any],
out_mm_kwargs: MultiModalKwargs,
) -> list[PromptReplacement]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to update this based on #13964

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Player256 I tried this PR, but it doesn't work. I managed to make the model loaded. But it seems that the multimodal processor implementation still can't work.

Comment on lines 349 to 352
def get_replacement_ovis(image: PIL.Image.Image):
_, image_placeholders = self.preprocess_image(image)

return image_placeholders
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we re-process images here?

Comment on lines +292 to +293
def get_image_size_with_most_features(self) -> ImageSize:
return ImageSize(height=384,width=384)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that Ovis will use dynamic resize (https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B/blob/b8d93d7468f47fd803eb26ec2c1bc2d7e5fba60e/modeling_ovis.py#L135-L159), does 384x384 image size really return most image _features from visual tokenizer?

Comment on lines +540 to +543
if multimodal_embeddings is not None:
input_embeds = merge_multimodal_embeddings(
input_ids, input_embeds, multimodal_embeddings,
IMAGE_TOKEN_ID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that image_token_id has been replaced by different placeholder tokens here?

# place image placeholders
input_ids = []
pixel_values = []
image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID]
last_image_token_index = -1
for i in range(len(image_token_indices)):
head = 0 if i == 0 else image_token_indices[i - 1] + 1
tail = image_token_indices[i]
last_image_token_index = tail
input_ids.extend(raw_input_ids[head:tail])
try:
image = images[i]
raw_pixel_values, image_placeholders = self.preprocess_image(
image, max_partition=max_partition)
except Exception as e:
if propagate_exception:
raise e
logging.exception(e)
raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input()
input_ids.extend(image_placeholders)
pixel_values.append(raw_pixel_values)
input_ids.extend(raw_input_ids[last_image_token_index + 1:])
# return tensors
input_ids = torch.tensor(input_ids, dtype=torch.long)
pixel_values = torch.cat(pixel_values, dim=0) if len(pixel_values) > 0 else None
return prompt, input_ids, pixel_values

@Player256
Copy link
Author

@Player256 I tried this PR, but it doesn't work. I managed to make the model loaded. But it seems that the multimodal processor implementation still can't work.

Hey let me get back to you with a working implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: Ovis1.6-Gemma2-9B [New Model]: Add support for Ovis models
5 participants