feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

justinthelaw · 2024-07-25T18:19:22Z

Describe what should be investigated or refactored

vLLM is currently not compatible with all GPTQ BFLOAT16 quanitzed models due to the dependency version (0.4.2). This needs to be upgraded to the next patch version (0.4.3), or just completely upgraded to the next minor version (0.5.2).

The following test model should work if this issue has been fixed (fits on RTX 4060 - 4090): https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json

Links to any relevant code

Example model that wouldn't work, but should: https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json

Issue related to the vLLM GPTQ BFLOAT16 PR: vllm-project/vllm#2149

Additional context

This issue was confirmed when deploying Nous-Hermes-2-8x7b-DPO-GPTQ (8bit, 128g group size and Act Order) to an H100 GPU. Changing the config.json to float16, despite imprecision, allows the model to be inferenced by vLLM.

The text was updated successfully, but these errors were encountered:

justinthelaw · 2024-07-29T13:50:34Z

Example of TheBloke's model quantizations being outdated: vllm-project/vllm#2422 (comment)

justinthelaw · 2024-07-29T13:53:11Z

The configuration we pass to vLLM should not include quantization, as that prevents automatic marlin_gptq quantization which uses a different algorithm to perform faster inferencing and less memory usage. Quantization is defined in all models' quantization_config.json.

Also, trust_remote_code refers to the code downloaded as part of the model download, so this can be safely turned on as long as we review the extra Python scripts downloaded as part of the model download. These scripts usually just tell vLLM how to configure itself for inferencing the model architecture (e.g., Phi-3 GPTQ).

justinthelaw · 2024-07-30T14:42:26Z

Above screenshots comparing (generally) Phi-3-mini-128k-instruct outperforming all other Mistral-7b-instruct variants.

Working on outside-spike to create a quantized version of Phi-3-mini-128k-instruct: https://github.com/justinthelaw/gptqmodel-pipeline

justinthelaw added dependencies Pull requests that update a dependency file tech-debt Not a feature, but still necessary labels Jul 25, 2024

justinthelaw added this to the Current - RAG UX Enhancements | Model Directory | API Odds and Ends milestone Jul 25, 2024

CollectiveUnicorn assigned justinthelaw Jul 25, 2024

justinthelaw mentioned this issue Jul 30, 2024

feat(vllm)!: upgrade vllm backend and refactor deployment #854

Merged

justinthelaw linked a pull request Jul 30, 2024 that will close this issue

feat(vllm)!: upgrade vllm backend and refactor deployment #854

Merged

justinthelaw changed the title ~~chore(vllm): upgrade vllm for gptq bfloat16 inferencing~~ chore(vllm): upgrade vllm to latest for SOTA model compatibility Jul 30, 2024

justinthelaw changed the title ~~chore(vllm): upgrade vllm to latest for SOTA model compatibility~~ chore(vllm): upgrade vllm for bfloat16 quant compatibility Aug 19, 2024

justinthelaw modified the milestones: Current - RAG UX Enhancements | Model Directory | API Odds and Ends, Next (M11) - Conformance | Stability | Documentation Sep 4, 2024

justinthelaw changed the title ~~chore(vllm): upgrade vllm for bfloat16 quant compatibility~~ feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility Sep 4, 2024

This was referenced Sep 11, 2024

chore!: add containerization and packaging manifest lints #937

Merged

chore(deps): update dependency vllm to v0.5.5 [security] #1053

Closed

This was referenced Sep 23, 2024

feat: llama-cpp-python updates, refactor and enhancements #1098

Open

feat: whisper updates, refactor and enhancements #1108

Open

Changing models is a huge pain #1082

Open

justinthelaw closed this as completed in #854 Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

justinthelaw commented Jul 25, 2024 •

edited

Loading

justinthelaw commented Jul 29, 2024 •

edited

Loading

justinthelaw commented Jul 29, 2024 •

edited

Loading

justinthelaw commented Jul 30, 2024

feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

Comments

justinthelaw commented Jul 25, 2024 • edited Loading

Describe what should be investigated or refactored

Links to any relevant code

Additional context

justinthelaw commented Jul 29, 2024 • edited Loading

justinthelaw commented Jul 29, 2024 • edited Loading

justinthelaw commented Jul 30, 2024

justinthelaw commented Jul 25, 2024 •

edited

Loading

justinthelaw commented Jul 29, 2024 •

edited

Loading

justinthelaw commented Jul 29, 2024 •

edited

Loading