gguf quantize and speed up support #9926

chuck-ma · 2024-11-14T03:57:02Z

Is your feature request related to a problem? Please describe.
GGUF is becoming the mainstream method for large model compression and accelerated inference. Transformers currently supports the loading of T5's GGUF format, but inference does not support acceleration.

Describe the solution you'd like.
If models in the gguf format (such as t5 and flux transformer component) can support loading of gguf format files and at the same time can achieve inference in the same format during inference, instead of converting to float32 for inference, it will be very helpful.

Describe alternatives you've considered.

Additional context.

sayakpaul · 2024-11-14T13:37:05Z

#9487 (comment)

Cc: @DN6

DN6 · 2024-11-14T15:28:28Z

Hi @chuck-ma. PR is in the works for what you're describing. I will open it soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gguf quantize and speed up support #9926

gguf quantize and speed up support #9926

chuck-ma commented Nov 14, 2024 •

edited

Loading

sayakpaul commented Nov 14, 2024

DN6 commented Nov 14, 2024

gguf quantize and speed up support #9926

gguf quantize and speed up support #9926

Comments

chuck-ma commented Nov 14, 2024 • edited Loading

sayakpaul commented Nov 14, 2024

DN6 commented Nov 14, 2024

chuck-ma commented Nov 14, 2024 •

edited

Loading