Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gguf quantize and speed up support #9926

Open
chuck-ma opened this issue Nov 14, 2024 · 2 comments
Open

gguf quantize and speed up support #9926

chuck-ma opened this issue Nov 14, 2024 · 2 comments

Comments

@chuck-ma
Copy link

chuck-ma commented Nov 14, 2024

Is your feature request related to a problem? Please describe.
GGUF is becoming the mainstream method for large model compression and accelerated inference. Transformers currently supports the loading of T5's GGUF format, but inference does not support acceleration.

Describe the solution you'd like.
If models in the gguf format (such as t5 and flux transformer component) can support loading of gguf format files and at the same time can achieve inference in the same format during inference, instead of converting to float32 for inference, it will be very helpful.

Describe alternatives you've considered.

Additional context.

@sayakpaul
Copy link
Member

#9487 (comment)

Cc: @DN6

@DN6
Copy link
Collaborator

DN6 commented Nov 14, 2024

Hi @chuck-ma. PR is in the works for what you're describing. I will open it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants