Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model Request] Molmo-72B-0924 #408

Open
dprokhorov17 opened this issue Jan 11, 2025 · 5 comments
Open

[Model Request] Molmo-72B-0924 #408

dprokhorov17 opened this issue Jan 11, 2025 · 5 comments
Assignees

Comments

@dprokhorov17
Copy link

Hello,

would you guys please take a look at this great model https://huggingface.co/allenai/Molmo-7B-D-0924 and quantize it?

Thanks in advance.

@wenhuach21
Copy link
Contributor

We will take a look and keep you updated. 7B or 72B?

@dprokhorov17
Copy link
Author

72B is more interesting

@wenhuach21
Copy link
Contributor

7B has been uploaded and is available at: Hugging Face. Please allow additional time for the quantization of the 72B model.

@dprokhorov17
Copy link
Author

dprokhorov17 commented Jan 14, 2025

Hello @wenhuach21 ,

I have tested your quantized model against the default 7B model with the following results:

Image Image

The performance drop is massive! That's 2x slower then the original...

PS: I quickly hacked the openedai-vision repo with your implementation you have provided (https://huggingface.co/OPEA/Molmo-7B-D-0924-int4-sym-inc)

@wenhuach21
Copy link
Contributor

wenhuach21 commented Jan 14, 2025

While INT4 models typically offer faster performance during the generation phase due to reduced memory usage, the perfill stage (prompt processing) may be slower compared to 16-bit models, as it is more computation-bound. Consequently, the performance difference between INT4 and 16-bit models largely depends on the length of the prompt and the number of generation tokens. For vlms, there are some extra prefilled tokens introduced by images/videos.

Another option is to conduct computations using the INT8 data type, which I believe is supported by Intel's extension for PyTorch on CPUs. It might be worth trying this approach..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants