-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add Support for Ovis1.6-Gemma2-9B Model #11240
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model implementation is coupling the image processing and model forwarding...
You can refer to the model implementation in llava.py
and phi3v.py
when adding model implementation.
any news? |
Hey @Isotr0py could you give this PR a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the model implementation becomes better, there are still lots of things needed to be done:
- Update the documentation to mention this supported model in
docs/source/models/supported_models.md
- Add example in
examples/offline_inference/vision_language.py
, if this model support multi-image inputs, please also updateexamples/offline_inference/vision_language_multi_image.py
- Add model correctness tests in
tests/models/decoder_only/vision_language/test_models.py
and processor correctness test intests/models/multimodal/processing/test_common.py
- Update
tests/models/registry.py
with model information.
vllm/model_executor/models/ovis.py
Outdated
# def merge_multimodal( | ||
# self, | ||
# text_input_ids: torch.Tensor, | ||
# text_attention_masks: torch.Tensor, | ||
# text_labels: Optional[torch.Tensor], | ||
# pixel_values: List[Optional[torch.Tensor]], | ||
# left_padding: bool = False | ||
# ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this unused code.
vllm/model_executor/models/ovis.py
Outdated
@cached_property | ||
def sampler(self): | ||
if hasattr(self.llm,"sampler"): | ||
return self.llm.sampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing fallback?
Please address pre-commit linting errors as well. |
Thanks @Isotr0py for the review, I'll get back to it. |
will this PR cover also new Ovis 2 models? https://huggingface.co/collections/AIDC-AI/ovis2-67ab36c7e497429034874464 |
Signed-off-by: Player256 <[email protected]>
I'll add the tests for it. |
tests/models/registry.py
Outdated
@@ -270,6 +270,7 @@ def check_available_online( | |||
trust_remote_code=True), | |||
"NVLM_D": _HfExamplesInfo("nvidia/NVLM-D-72B", | |||
trust_remote_code=True), | |||
"OvisForConditionalGeneration": _HfExamplesInfo("AIDC-AI/Ovis1.6-Gemma2-9B",trust_remote_code=True), # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"OvisForConditionalGeneration": _HfExamplesInfo("AIDC-AI/Ovis1.6-Gemma2-9B",trust_remote_code=True), # noqa: E501 | |
"OvisForConditionalGeneration": _HfExamplesInfo("AIDC-AI/Ovis1.6-Llama3.2-3B",trust_remote_code=True), # noqa: E501 |
We can use a smaller model for registry and test: AIDC-AI/Ovis1.6-Llama3.2-3B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
Signed-off-by: Isotr0py <[email protected]>
def _get_prompt_replacements( | ||
self, | ||
mm_items: MultiModalDataItems, | ||
hf_processor_mm_kwargs: Mapping[str, Any], | ||
out_mm_kwargs: MultiModalKwargs, | ||
) -> list[PromptReplacement]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to update this based on #13964
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Player256 I tried this PR, but it doesn't work. I managed to make the model loaded. But it seems that the multimodal processor implementation still can't work.
vllm/model_executor/models/ovis.py
Outdated
def get_replacement_ovis(image: PIL.Image.Image): | ||
_, image_placeholders = self.preprocess_image(image) | ||
|
||
return image_placeholders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we re-process images here?
def get_image_size_with_most_features(self) -> ImageSize: | ||
return ImageSize(height=384,width=384) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that Ovis will use dynamic resize (https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B/blob/b8d93d7468f47fd803eb26ec2c1bc2d7e5fba60e/modeling_ovis.py#L135-L159), does 384x384
image size really return most image _features from visual tokenizer?
if multimodal_embeddings is not None: | ||
input_embeds = merge_multimodal_embeddings( | ||
input_ids, input_embeds, multimodal_embeddings, | ||
IMAGE_TOKEN_ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that image_token_id has been replaced by different placeholder tokens here?
vllm/vllm/model_executor/models/ovis.py
Lines 199 to 226 in f2eca81
# place image placeholders | |
input_ids = [] | |
pixel_values = [] | |
image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID] | |
last_image_token_index = -1 | |
for i in range(len(image_token_indices)): | |
head = 0 if i == 0 else image_token_indices[i - 1] + 1 | |
tail = image_token_indices[i] | |
last_image_token_index = tail | |
input_ids.extend(raw_input_ids[head:tail]) | |
try: | |
image = images[i] | |
raw_pixel_values, image_placeholders = self.preprocess_image( | |
image, max_partition=max_partition) | |
except Exception as e: | |
if propagate_exception: | |
raise e | |
logging.exception(e) | |
raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input() | |
input_ids.extend(image_placeholders) | |
pixel_values.append(raw_pixel_values) | |
input_ids.extend(raw_input_ids[last_image_token_index + 1:]) | |
# return tensors | |
input_ids = torch.tensor(input_ids, dtype=torch.long) | |
pixel_values = torch.cat(pixel_values, dim=0) if len(pixel_values) > 0 else None | |
return prompt, input_ids, pixel_values |
Hey let me get back to you with a working implementation. |
This pull request addresses issue #9638 by adding support for the Ovis1.6-Gemma2-9B model.
FIX #8972
FIX #9638