feat(vllm)!: upgrade vllm backend and refactor deployment #854

justinthelaw · 2024-07-30T16:01:51Z

OVERVIEW

See #835 for more details on rationale and findings. Also related to #623.

IMPORTANT NOTE: there are still ongoing AsyncLLMEngineDead and RoPE scaling + Ray issues upstream that may prevent us from upgrading past 0.4.x.

BREAKING CHANGES:

moves all ENV specific to LeapfrogAI SDK to a ConfigMap using volumeMount for runtime injection and modification
- in local dev, this is defined via config.yaml
moves all ENV specific to vLLM to a ConfigMap, using envFrom for runtime injection and modification
- in local dev, this is defined via .env
ZARF_CONFIG is used to define create-time and deploy-time variables for (e.g., MODEL_REPO_ID, ENFORCE_EAGER)
- updates Make targets and workflows with new ZARF_CONFIG variable
- updates UDS bundles with new Zarf deployment variable overrides
- allows delivery engineer's declarative definition of the backend configs and model
re-introduces LFAI SDK config.yaml configuration method for local development and testing
MUST upgrade API and backends together due to FinishReason proto change

CHANGES:

updates docs for running vLLM locally, in docker, and in-cluster
update to Python 3.11.9 to align with Registry1 base image
upgrades vLLM from 0.4.2 to 0.4.3 for:
- adds BFloat16 quantized model support (the "why is this important" here)
exposes more backend configurations for Zarf build (see packages/vllm/README.md)
- adds full set of QUANTIZATION options to existing configuration field
- exposes everything via a Zarf variable and the values files
removes default vLLM engine configurations from the Dockerfile (no duplicates, hardcoding)
fixes issue where request params are not used in generation of the response (e.g., temperature)
- uses LFAI SDK-received request object for inferencing params
- gracefully handles vLLM versus SDK param names/differences for generation

ADDITIONAL CONTEXT

The default model is still Synthia-7b, until #976 is resolved. The below description is only being kept for future context:

The new model option, defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g, takes ~4.16Gb to load on to RAM or vRAM. At 6Gb of vRAM, the max_context_length variable has to be reduced to ~400 tokens; at 8Gb vRAM, ~10K tokens; and at 12Gb of vRAM, ~15K tokens. All cases require the max_gpu_utilization variable to be 0.99 in order to max out the KV cache size for the context length that is to be reserved at inference.

To achieve the maximum context length (~32K tokens) of the model, 16Gb of vRAM is required.

Reaching the defined max_context_length during a completion or chatting will result in the activation of vLLM's automatic sliding window handler, which drops quality of the final responses significantly.

netlify · 2024-07-30T16:02:09Z

✅ Deploy Preview for leapfrogai-docs canceled.

Name	Link
🔨 Latest commit	`c9c2408`
🔍 Latest deploy log	https://app.netlify.com/sites/leapfrogai-docs/deploys/66a90e7290146c0008f391e7

netlify · 2024-07-30T16:02:29Z

✅ Deploy Preview for leapfrogai-docs canceled.

Name	Link
🔨 Latest commit	`d92b572`
🔍 Latest deploy log	https://app.netlify.com/sites/leapfrogai-docs/deploys/66fea1d7f34a34000865ec5b

justinthelaw · 2024-07-30T21:02:51Z

For PR review, I recommend deploying / testing out different build arguments (see the Dockerfile), to include swapping out the model and default backend configurations, to make sure things work with the new vLLM engine.

justinthelaw · 2024-07-30T21:04:20Z

See this issue on CPU offloading: vllm-project/vllm#6952

The offloading works... but the actual inferencing runs into issues when certain parts of the code are looking for a model weight that isn't where it expects - specifically for Phi-3. That is why I haven't added this feature/parameter to this PR.

justinthelaw · 2024-07-31T20:50:16Z

Converting back to draft - some odd engine behavior / interaction with quantized Phi-3 RoPE scaling and engine crashing. Investigating...

justinthelaw · 2024-07-31T20:52:19Z

The LLM responds correctly; however, the engine runs into some blocking process that causes a timeout on subsequent responses when the conversation reaches 2-5 messages long.

See this issue upstream: vllm-project/vllm#5901

packages/vllm/.env.example

justinthelaw · 2024-08-02T16:19:20Z

https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard - for checking Hermes-2-Pro-Mistral-7b's current standings against other open source LLMs.

packages/vllm/src/config.py

tests/utils/client.py

…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

src/leapfrogai_api/typedef/completion/completion_types.py

…1-weekly-bundle-0.13.0

…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

packages/vllm/README.md

jalling97

Glad to see the process for updating models is simplified (both locally and in deployments) and the docs are clear as to how it works! Can confirm that vllm builds and deploys to a local cluster without issue. It's a shame about the issues with newer version of vLLM, there are some features that would be nice to utilize. Great stuff!

justinthelaw added enhancement New feature or request dependencies Pull requests that update a dependency file labels Jul 30, 2024

justinthelaw requested a review from a team as a code owner July 30, 2024 16:01

justinthelaw linked an issue Jul 30, 2024 that may be closed by this pull request

feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

Closed

justinthelaw marked this pull request as draft July 30, 2024 16:48

justinthelaw marked this pull request as ready for review July 30, 2024 20:31

justinthelaw changed the title ~~feat: upgrade vLLM dep and image, expose backend params~~ feat: upgrade vLLM image and expose more backend params Jul 30, 2024

justinthelaw changed the title ~~feat: upgrade vLLM image and expose more backend params~~ feat: enhance vLLM backend and expose more params Jul 30, 2024

justinthelaw marked this pull request as draft July 31, 2024 20:49

justinthelaw marked this pull request as ready for review August 1, 2024 22:39

justinthelaw commented Aug 2, 2024

View reviewed changes

packages/vllm/.env.example Outdated Show resolved Hide resolved

gphorvath reviewed Aug 2, 2024

View reviewed changes

packages/vllm/.env.example Outdated Show resolved Hide resolved

justinthelaw force-pushed the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch from 0acedbe to 416909c Compare August 2, 2024 21:19

alekst23 reviewed Aug 2, 2024

View reviewed changes

packages/vllm/src/config.py Outdated Show resolved Hide resolved

justinthelaw marked this pull request as draft August 5, 2024 17:42

justinthelaw marked this pull request as ready for review August 19, 2024 19:23

justinthelaw self-assigned this Aug 19, 2024

justinthelaw changed the title ~~feat: enhance vLLM backend and expose more params~~ feat: upgrade vllm backend and expose more params Aug 20, 2024

justinthelaw added dependencies Pull requests that update a dependency file and removed dependencies Pull requests that update a dependency file labels Aug 20, 2024

justinthelaw requested a review from a team August 29, 2024 19:47

justinthelaw force-pushed the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch from 84f16e0 to 2ac0cf8 Compare August 30, 2024 16:29

justinthelaw mentioned this pull request Sep 3, 2024

(spike) Determine open source LLMs to evaluate on #717

Closed

gphorvath reviewed Oct 1, 2024

View reviewed changes

tests/utils/client.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…

54af6dc

…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

justinthelaw dismissed CollectiveUnicorn’s stale review via 54af6dc October 1, 2024 19:11

justinthelaw added 2 commits October 1, 2024 15:19

supabase_url in wrong position

17e20fa

Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…

2c3b7f1

…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

gphorvath reviewed Oct 1, 2024

View reviewed changes

src/leapfrogai_api/typedef/completion/completion_types.py Outdated Show resolved Hide resolved

justinthelaw and others added 7 commits October 1, 2024 16:01

Merge remote-tracking branch 'origin/main' into chore-update-registry…

76efca3

…1-weekly-bundle-0.13.0

fastapi status code usage

1211e69

Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…

7e6bdb2

…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

FinishReason _missing_ class method

a4c5ace

new missing JWT

5ee07cf

Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…

df60811

…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

0c12449

justinthelaw requested review from a team, CollectiveUnicorn and gphorvath October 2, 2024 13:11

missing ZARF VAR passthrough to values

99c27c9

gphorvath previously approved these changes Oct 2, 2024

View reviewed changes

jalling97 reviewed Oct 3, 2024

View reviewed changes

packages/vllm/README.md Outdated Show resolved Hide resolved

more clarity in the README

c106e10

justinthelaw dismissed gphorvath’s stale review via c106e10 October 3, 2024 13:53

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

d92b572

justinthelaw requested review from gphorvath, jalling97 and a team October 3, 2024 13:53

jalling97 approved these changes Oct 3, 2024

View reviewed changes

gphorvath approved these changes Oct 3, 2024

View reviewed changes

justinthelaw merged commit fd3cbc4 into main Oct 3, 2024
35 checks passed

justinthelaw deleted the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch October 3, 2024 16:07

github-actions bot mentioned this pull request Oct 3, 2024

chore(main): release 0.14.0 #1160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm)!: upgrade vllm backend and refactor deployment #854

feat(vllm)!: upgrade vllm backend and refactor deployment #854

justinthelaw commented Jul 30, 2024 •

edited

Loading

netlify bot commented Jul 30, 2024 •

edited

Loading

netlify bot commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Jul 31, 2024 •

edited

Loading

justinthelaw commented Jul 31, 2024 •

edited

Loading

justinthelaw commented Aug 2, 2024 •

edited

Loading

jalling97 left a comment

feat(vllm)!: upgrade vllm backend and refactor deployment #854

feat(vllm)!: upgrade vllm backend and refactor deployment #854

Conversation

justinthelaw commented Jul 30, 2024 • edited Loading

OVERVIEW

BREAKING CHANGES:

CHANGES:

ADDITIONAL CONTEXT

netlify bot commented Jul 30, 2024 • edited Loading

✅ Deploy Preview for leapfrogai-docs canceled.

netlify bot commented Jul 30, 2024 • edited Loading

✅ Deploy Preview for leapfrogai-docs canceled.

justinthelaw commented Jul 30, 2024 • edited Loading

justinthelaw commented Jul 30, 2024 • edited Loading

justinthelaw commented Jul 31, 2024 • edited Loading

justinthelaw commented Jul 31, 2024 • edited Loading

justinthelaw commented Aug 2, 2024 • edited Loading

jalling97 left a comment

Choose a reason for hiding this comment

justinthelaw commented Jul 30, 2024 •

edited

Loading

netlify bot commented Jul 30, 2024 •

edited

Loading

netlify bot commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Jul 31, 2024 •

edited

Loading

justinthelaw commented Jul 31, 2024 •

edited

Loading

justinthelaw commented Aug 2, 2024 •

edited

Loading