This repository has been archived by the owner on Feb 15, 2025. It is now read-only.
feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835
Labels
dependencies
Pull requests that update a dependency file
tech-debt
Not a feature, but still necessary
Milestone
Describe what should be investigated or refactored
vLLM is currently not compatible with all GPTQ BFLOAT16 quanitzed models due to the dependency version (0.4.2). This needs to be upgraded to the next patch version (0.4.3), or just completely upgraded to the next minor version (0.5.2).
The following test model should work if this issue has been fixed (fits on RTX 4060 - 4090): https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json
Links to any relevant code
Example model that wouldn't work, but should: https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json
Issue related to the vLLM GPTQ BFLOAT16 PR: vllm-project/vllm#2149
Additional context
This issue was confirmed when deploying Nous-Hermes-2-8x7b-DPO-GPTQ (8bit, 128g group size and Act Order) to an H100 GPU. Changing the
config.json
tofloat16
, despite imprecision, allows the model to be inferenced by vLLM.The text was updated successfully, but these errors were encountered: