-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model Request] Molmo-72B-0924 #408
Comments
We will take a look and keep you updated. 7B or 72B? |
72B is more interesting |
7B has been uploaded and is available at: Hugging Face. Please allow additional time for the quantization of the 72B model. |
Hello @wenhuach21 , I have tested your quantized model against the default 7B model with the following results: The performance drop is massive! That's 2x slower then the original... PS: I quickly hacked the openedai-vision repo with your implementation you have provided (https://huggingface.co/OPEA/Molmo-7B-D-0924-int4-sym-inc) |
While INT4 models typically offer faster performance during the generation phase due to reduced memory usage, the perfill stage (prompt processing) may be slower compared to 16-bit models, as it is more computation-bound. Consequently, the performance difference between INT4 and 16-bit models largely depends on the length of the prompt and the number of generation tokens. For vlms, there are some extra prefilled tokens introduced by images/videos. Another option is to conduct computations using the INT8 data type, which I believe is supported by Intel's extension for PyTorch on CPUs. It might be worth trying this approach.. |
Hello,
would you guys please take a look at this great model https://huggingface.co/allenai/Molmo-7B-D-0924 and quantize it?
Thanks in advance.
The text was updated successfully, but these errors were encountered: