-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(vllm)!: upgrade vllm backend and refactor deployment #854
feat(vllm)!: upgrade vllm backend and refactor deployment #854
Conversation
✅ Deploy Preview for leapfrogai-docs canceled.
|
✅ Deploy Preview for leapfrogai-docs canceled.
|
For PR review, I recommend deploying / testing out different build arguments (see the Dockerfile), to include swapping out the model and default backend configurations, to make sure things work with the new vLLM engine. |
See this issue on CPU offloading: vllm-project/vllm#6952 The offloading works... but the actual inferencing runs into issues when certain parts of the code are looking for a model weight that isn't where it expects - specifically for Phi-3. That is why I haven't added this feature/parameter to this PR. |
Converting back to draft - some odd engine behavior / interaction with quantized Phi-3 RoPE scaling and engine crashing. Investigating... |
The LLM responds correctly; however, the engine runs into some blocking process that causes a timeout on subsequent responses when the conversation reaches 2-5 messages long. See this issue upstream: vllm-project/vllm#5901 |
https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard - for checking Hermes-2-Pro-Mistral-7b's current standings against other open source LLMs. |
0acedbe
to
416909c
Compare
84f16e0
to
2ac0cf8
Compare
…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
…1-weekly-bundle-0.13.0
…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
…ndle-0.13.0' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see the process for updating models is simplified (both locally and in deployments) and the docs are clear as to how it works! Can confirm that vllm builds and deploys to a local cluster without issue. It's a shame about the issues with newer version of vLLM, there are some features that would be nice to utilize. Great stuff!
OVERVIEW
See #835 for more details on rationale and findings. Also related to #623.
IMPORTANT NOTE: there are still ongoing AsyncLLMEngineDead and RoPE scaling + Ray issues upstream that may prevent us from upgrading past
0.4.x
.BREAKING CHANGES:
volumeMount
for runtime injection and modificationconfig.yaml
envFrom
for runtime injection and modification.env
ZARF_CONFIG
is used to define create-time and deploy-time variables for (e.g.,MODEL_REPO_ID
,ENFORCE_EAGER
)ZARF_CONFIG
variableconfig.yaml
configuration method for local development and testingFinishReason
proto changeCHANGES:
packages/vllm/README.md
)QUANTIZATION
options to existing configuration fieldADDITIONAL CONTEXT
The default model is still Synthia-7b, until #976 is resolved. The below description is only being kept for future context: