Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vllm)!: upgrade vllm backend and refactor deployment #854

Merged
merged 350 commits into from
Oct 3, 2024

Conversation

justinthelaw
Copy link
Contributor

@justinthelaw justinthelaw commented Jul 30, 2024

OVERVIEW

See #835 for more details on rationale and findings. Also related to #623.

IMPORTANT NOTE: there are still ongoing AsyncLLMEngineDead and RoPE scaling + Ray issues upstream that may prevent us from upgrading past 0.4.x.

BREAKING CHANGES:

  • moves all ENV specific to LeapfrogAI SDK to a ConfigMap using volumeMount for runtime injection and modification
    • in local dev, this is defined via config.yaml
  • moves all ENV specific to vLLM to a ConfigMap, using envFrom for runtime injection and modification
    • in local dev, this is defined via .env
  • ZARF_CONFIG is used to define create-time and deploy-time variables for (e.g., MODEL_REPO_ID, ENFORCE_EAGER)
    • updates Make targets and workflows with new ZARF_CONFIG variable
    • updates UDS bundles with new Zarf deployment variable overrides
    • allows delivery engineer's declarative definition of the backend configs and model
  • re-introduces LFAI SDK config.yaml configuration method for local development and testing
  • MUST upgrade API and backends together due to FinishReason proto change

CHANGES:

  • updates docs for running vLLM locally, in docker, and in-cluster
  • update to Python 3.11.9 to align with Registry1 base image
  • upgrades vLLM from 0.4.2 to 0.4.3 for:
    • adds BFloat16 quantized model support (the "why is this important" here)
  • exposes more backend configurations for Zarf build (see packages/vllm/README.md)
    • adds full set of QUANTIZATION options to existing configuration field
    • exposes everything via a Zarf variable and the values files
  • removes default vLLM engine configurations from the Dockerfile (no duplicates, hardcoding)
  • fixes issue where request params are not used in generation of the response (e.g., temperature)
    • uses LFAI SDK-received request object for inferencing params
    • gracefully handles vLLM versus SDK param names/differences for generation

ADDITIONAL CONTEXT

The default model is still Synthia-7b, until #976 is resolved. The below description is only being kept for future context:

The new model option, defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g, takes ~4.16Gb to load on to RAM or vRAM. At 6Gb of vRAM, the max_context_length variable has to be reduced to ~400 tokens; at 8Gb vRAM, ~10K tokens; and at 12Gb of vRAM, ~15K tokens. All cases require the max_gpu_utilization variable to be 0.99 in order to max out the KV cache size for the context length that is to be reserved at inference.

To achieve the maximum context length (~32K tokens) of the model, 16Gb of vRAM is required.

Reaching the defined max_context_length during a completion or chatting will result in the activation of vLLM's automatic sliding window handler, which drops quality of the final responses significantly.

@justinthelaw justinthelaw added enhancement New feature or request dependencies Pull requests that update a dependency file labels Jul 30, 2024
@justinthelaw justinthelaw requested a review from a team as a code owner July 30, 2024 16:01
Copy link

netlify bot commented Jul 30, 2024

Deploy Preview for leapfrogai-docs canceled.

Name Link
🔨 Latest commit c9c2408
🔍 Latest deploy log https://app.netlify.com/sites/leapfrogai-docs/deploys/66a90e7290146c0008f391e7

Copy link

netlify bot commented Jul 30, 2024

Deploy Preview for leapfrogai-docs canceled.

Name Link
🔨 Latest commit d92b572
🔍 Latest deploy log https://app.netlify.com/sites/leapfrogai-docs/deploys/66fea1d7f34a34000865ec5b

@justinthelaw justinthelaw marked this pull request as draft July 30, 2024 16:48
@justinthelaw justinthelaw marked this pull request as ready for review July 30, 2024 20:31
@justinthelaw justinthelaw changed the title feat: upgrade vLLM dep and image, expose backend params feat: upgrade vLLM image and expose more backend params Jul 30, 2024
@justinthelaw justinthelaw changed the title feat: upgrade vLLM image and expose more backend params feat: enhance vLLM backend and expose more params Jul 30, 2024
@justinthelaw
Copy link
Contributor Author

justinthelaw commented Jul 30, 2024

For PR review, I recommend deploying / testing out different build arguments (see the Dockerfile), to include swapping out the model and default backend configurations, to make sure things work with the new vLLM engine.

@justinthelaw
Copy link
Contributor Author

justinthelaw commented Jul 30, 2024

See this issue on CPU offloading: vllm-project/vllm#6952

The offloading works... but the actual inferencing runs into issues when certain parts of the code are looking for a model weight that isn't where it expects - specifically for Phi-3. That is why I haven't added this feature/parameter to this PR.

@justinthelaw justinthelaw marked this pull request as draft July 31, 2024 20:49
@justinthelaw
Copy link
Contributor Author

justinthelaw commented Jul 31, 2024

Converting back to draft - some odd engine behavior / interaction with quantized Phi-3 RoPE scaling and engine crashing. Investigating...

@justinthelaw
Copy link
Contributor Author

justinthelaw commented Jul 31, 2024

Screenshot 2024-07-31 165127

The LLM responds correctly; however, the engine runs into some blocking process that causes a timeout on subsequent responses when the conversation reaches 2-5 messages long.

See this issue upstream: vllm-project/vllm#5901

@justinthelaw justinthelaw marked this pull request as ready for review August 1, 2024 22:39
@justinthelaw
Copy link
Contributor Author

justinthelaw commented Aug 2, 2024

https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard - for checking Hermes-2-Pro-Mistral-7b's current standings against other open source LLMs.

@justinthelaw justinthelaw force-pushed the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch from 0acedbe to 416909c Compare August 2, 2024 21:19
@justinthelaw justinthelaw marked this pull request as draft August 5, 2024 17:42
@justinthelaw justinthelaw marked this pull request as ready for review August 19, 2024 19:23
@justinthelaw justinthelaw self-assigned this Aug 19, 2024
@justinthelaw justinthelaw changed the title feat: enhance vLLM backend and expose more params feat: upgrade vllm backend and expose more params Aug 20, 2024
@justinthelaw justinthelaw added dependencies Pull requests that update a dependency file and removed dependencies Pull requests that update a dependency file labels Aug 20, 2024
@justinthelaw justinthelaw requested a review from a team August 29, 2024 19:47
@justinthelaw justinthelaw force-pushed the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch from 84f16e0 to 2ac0cf8 Compare August 30, 2024 16:29
tests/utils/client.py Outdated Show resolved Hide resolved
…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
gphorvath
gphorvath previously approved these changes Oct 2, 2024
packages/vllm/README.md Outdated Show resolved Hide resolved
@justinthelaw justinthelaw requested review from gphorvath, jalling97 and a team October 3, 2024 13:53
Copy link
Contributor

@jalling97 jalling97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see the process for updating models is simplified (both locally and in deployments) and the docs are clear as to how it works! Can confirm that vllm builds and deploys to a local cluster without issue. It's a shame about the issues with newer version of vLLM, there are some features that would be nice to utilize. Great stuff!

@justinthelaw justinthelaw merged commit fd3cbc4 into main Oct 3, 2024
35 checks passed
@justinthelaw justinthelaw deleted the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch October 3, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment