Skip to content
This repository has been archived by the owner on Feb 15, 2025. It is now read-only.

feat(vllm)!: upgrade vllm backend and refactor deployment #854

Merged
merged 350 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
350 commits
Select commit Hold shift + click to select a range
301fd42
fix Dockerfile lint
justinthelaw Sep 16, 2024
2a2c7d6
re-added default tensor size
justinthelaw Sep 16, 2024
98227a6
fix README
justinthelaw Sep 16, 2024
c620efa
cleanup
justinthelaw Sep 16, 2024
6593fbb
3.11.9 python
justinthelaw Sep 16, 2024
79272d1
fix FinishReason, add vLLM E2E
justinthelaw Sep 17, 2024
927ad25
llama completion test, add CompleteStreamChoice
justinthelaw Sep 17, 2024
e9e434f
condense e2e to 1 file, add max_new_tokens
justinthelaw Sep 17, 2024
d8c6767
formatting fix
justinthelaw Sep 17, 2024
29a9785
max_tokens for OpenAI client
justinthelaw Sep 17, 2024
a166c93
fix singular model_name arg
justinthelaw Sep 17, 2024
1c63741
isolate model_name to single test
justinthelaw Sep 17, 2024
2e82a9f
fix e2e-llama-cpp-python.yaml
justinthelaw Sep 17, 2024
807128e
Update e2e-vllm.yaml
justinthelaw Sep 17, 2024
e48331f
model_name fixture
justinthelaw Sep 17, 2024
e88b29f
Merge remote-tracking branch 'origin/main' into 1037-testvllm-impleme…
justinthelaw Sep 17, 2024
b366c5f
Merge remote-tracking branch 'origin/main' into 835-upgrade-vllm-for-…
justinthelaw Sep 17, 2024
ecbd4f7
handle request queue possibly being None
justinthelaw Sep 17, 2024
8552ce0
workaround GPU runner issue
justinthelaw Sep 17, 2024
af4e4ca
workaround GPU runner issue, pt.2
justinthelaw Sep 17, 2024
5b1532a
workaround GPU runner issue, pt.3
justinthelaw Sep 17, 2024
a8551e5
workaround GPU runner issue, pt.4
justinthelaw Sep 17, 2024
5f1b3c1
temp turn on e2e vllm, add nvidia-smi
justinthelaw Sep 17, 2024
1e7e98c
add nvidia setp
justinthelaw Sep 17, 2024
c46731a
fix cluster cmd, play with prompt
justinthelaw Sep 17, 2024
161fb3a
k3d permissions
justinthelaw Sep 17, 2024
84a0388
Update e2e-vllm.yaml
justinthelaw Sep 17, 2024
cb905ff
Update e2e-llama-cpp-python.yaml
justinthelaw Sep 17, 2024
6afb992
e2e-vllm.yaml with lfai-core
justinthelaw Sep 17, 2024
094da70
vllm e2e missing cluster create
justinthelaw Sep 17, 2024
f5d9f82
fix llama e2e steps
justinthelaw Sep 17, 2024
9fb28fa
test GPU cluster health
justinthelaw Sep 17, 2024
c19cec2
test GPU runner deps, pt.1
justinthelaw Sep 17, 2024
8767649
test GPU runner deps, pt.2
justinthelaw Sep 17, 2024
52857c5
test GPU runner deps, pt.3
justinthelaw Sep 17, 2024
287b911
test GPU runner deps, pt.4
justinthelaw Sep 17, 2024
e0b7e18
test GPU runner deps, pt.5
justinthelaw Sep 17, 2024
042248d
add comments
justinthelaw Sep 17, 2024
64079aa
better comments, log test outputs
justinthelaw Sep 17, 2024
0148b92
add wait-for, more comments
justinthelaw Sep 17, 2024
04ab8b2
remove formatting
justinthelaw Sep 17, 2024
635bdaf
fix CUDA pod test
justinthelaw Sep 17, 2024
c85a00c
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 17, 2024
4b73ba9
Merge remote-tracking branch 'origin/main' into 1037-testvllm-impleme…
justinthelaw Sep 17, 2024
f7b2a50
reduced context window
justinthelaw Sep 17, 2024
1bef345
remove pytest cache Make target
justinthelaw Sep 17, 2024
8b2af46
vLLM deployment debugging
justinthelaw Sep 17, 2024
9dae852
revert formatting
justinthelaw Sep 17, 2024
d44a907
fix build, add better debugging steps
justinthelaw Sep 17, 2024
5af2d70
fix Kubectl commands
justinthelaw Sep 17, 2024
8befd3b
nvidia daemonset debug
justinthelaw Sep 17, 2024
32a1c31
set nvidia runtime as default
justinthelaw Sep 17, 2024
1e7aca1
check node issues
justinthelaw Sep 17, 2024
2464cc4
draft, node detailed describe
justinthelaw Sep 17, 2024
c7b4aa3
Update cuda-vector-add.yaml
justinthelaw Sep 17, 2024
2245c7c
Update cuda-vector-add.yaml
justinthelaw Sep 17, 2024
b1933c2
more cluster runner debugging
justinthelaw Sep 17, 2024
325f520
remove erroneous journal to command
justinthelaw Sep 18, 2024
87cc755
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 18, 2024
32ed63c
Merge remote-tracking branch 'origin/main' into 1037-testvllm-impleme…
justinthelaw Sep 18, 2024
5c13861
docker-level debug addition
justinthelaw Sep 18, 2024
e4e4611
downgrade CUDA version
justinthelaw Sep 18, 2024
32dad39
downgrade CUDA version, again
justinthelaw Sep 18, 2024
850100f
try root full
justinthelaw Sep 18, 2024
d1d6e48
try root, pt.2
justinthelaw Sep 18, 2024
df61e46
try root, pt.3
justinthelaw Sep 18, 2024
34926e9
different tests and logs
justinthelaw Sep 18, 2024
547a64b
typo
justinthelaw Sep 18, 2024
59ce6f6
revert to old daemonset version
justinthelaw Sep 18, 2024
284812d
typo
justinthelaw Sep 18, 2024
b222543
add config.toml to k3s image
justinthelaw Sep 18, 2024
76cccbc
get failure reason
justinthelaw Sep 18, 2024
d6aacf0
Merge branch 'main' into 1037-testvllm-implement-e2e-testing-for-vllm
justinthelaw Sep 18, 2024
c9e7840
just see if change in containerd config works
justinthelaw Sep 18, 2024
1514ead
Dockerfile changes, apply both tests
justinthelaw Sep 18, 2024
a437b7b
typo
justinthelaw Sep 18, 2024
66ef462
fix image tag, add NVIDIA capabilities all
justinthelaw Sep 18, 2024
c9d480c
align docker test, add node label
justinthelaw Sep 18, 2024
a32226a
add quotes, increase priv
justinthelaw Sep 18, 2024
db04bd0
Merge remote-tracking branch 'origin/main' into 1037-testvllm-impleme…
justinthelaw Sep 18, 2024
d199203
add nfd
justinthelaw Sep 18, 2024
bd89870
add nfd, pt.1
justinthelaw Sep 18, 2024
2ce805b
remove nfd
justinthelaw Sep 18, 2024
3be3648
remove set-as-default
justinthelaw Sep 18, 2024
9cf7d7f
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 18, 2024
5f777e0
Merge branch 'main' into 1037-testvllm-implement-e2e-testing-for-vllm
justinthelaw Sep 18, 2024
8d28084
refactor, unload drivers
justinthelaw Sep 18, 2024
c12aa82
script typo
justinthelaw Sep 18, 2024
6900dac
fix typos
justinthelaw Sep 18, 2024
3ab9228
slim k3d cluster, permission workaround
justinthelaw Sep 18, 2024
7dd8abf
k3d bootstrap match
justinthelaw Sep 18, 2024
79f8d30
k3d server name
justinthelaw Sep 18, 2024
2811359
nvidia wait-for
justinthelaw Sep 18, 2024
3cf42eb
remove extra stuff
justinthelaw Sep 18, 2024
331584e
pods out first
justinthelaw Sep 18, 2024
e7fdf7c
node out first, whoami
justinthelaw Sep 18, 2024
0662106
which k3d
justinthelaw Sep 18, 2024
6b04c55
sleep!
justinthelaw Sep 18, 2024
6110ec4
root user
justinthelaw Sep 18, 2024
9f9157c
root user, pt.2
justinthelaw Sep 18, 2024
664709b
revert vllm e2e GPU runner changes
justinthelaw Sep 18, 2024
f896e59
revert formatting changes
justinthelaw Sep 18, 2024
ef75a70
e2e tests made easier
justinthelaw Sep 18, 2024
2fcac88
Merge branch 'main' into 1037-testvllm-implement-e2e-testing-for-vllm
justinthelaw Sep 18, 2024
23c008e
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 18, 2024
d1d6540
e2e test Make target typo
justinthelaw Sep 18, 2024
2cfd164
Merge branch '1037-testvllm-implement-e2e-testing-for-vllm' of https:…
justinthelaw Sep 18, 2024
09510b7
zarf-config.yaml changes docs
justinthelaw Sep 18, 2024
1e89fac
add load_format
justinthelaw Sep 18, 2024
0568232
revert format e2e-llama-cpp-python.yaml
justinthelaw Sep 18, 2024
cc7ac6c
fixed Makefile typo
justinthelaw Sep 18, 2024
8a07080
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 18, 2024
f335be7
attempt merge with main
justinthelaw Sep 18, 2024
e0c0ac7
better clean-up
justinthelaw Sep 19, 2024
c90d820
add FinishReason enum back in
justinthelaw Sep 19, 2024
a1a03c1
passing unit tests
justinthelaw Sep 19, 2024
3da388f
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 19, 2024
3387974
Merge branch 'main' into 1037-testvllm-implement-e2e-testing-for-vllm
justinthelaw Sep 19, 2024
620e3b5
fixes GPU_LIMIT
justinthelaw Sep 20, 2024
09dd182
Merge remote-tracking branch 'origin/1037-testvllm-implement-e2e-test…
justinthelaw Sep 20, 2024
331a346
fixes load_format
justinthelaw Sep 20, 2024
6df5ebb
Merge branch 'main' into 1037-testvllm-implement-e2e-testing-for-vllm
justinthelaw Sep 20, 2024
304f659
Merge remote-tracking branch 'origin/1037-testvllm-implement-e2e-test…
justinthelaw Sep 20, 2024
cc46716
adds Docker container-only things
justinthelaw Sep 20, 2024
5ab0b99
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 20, 2024
da1399b
PR review fixes
justinthelaw Sep 20, 2024
59e1830
Merge remote-tracking branch 'origin/1037-testvllm-implement-e2e-test…
justinthelaw Sep 20, 2024
e963293
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 20, 2024
b9545b7
description for PROMPT_FORMAT*
justinthelaw Sep 20, 2024
5a6d59f
makefile clean improvements, add bundle configs
justinthelaw Sep 20, 2024
396370a
variabilize PYTHON_VERSION in vllm Dockerfile
justinthelaw Sep 20, 2024
b023dfa
missing download sub-cmd
justinthelaw Sep 20, 2024
f24180d
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 20, 2024
8ba6dcb
variabilize vllm directory
justinthelaw Sep 20, 2024
0186ad0
Merge branch '835-upgrade-vllm-for-gptq-bfloat16-inferencing' of http…
justinthelaw Sep 20, 2024
9791cb6
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 20, 2024
6e1ca0c
fix release.yaml
justinthelaw Sep 20, 2024
858b64f
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 20, 2024
ced0797
Update e2e-registry1-weekly.yaml
justinthelaw Sep 20, 2024
89d0d69
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 20, 2024
dd9b1bc
Update e2e-registry1-weekly.yaml
justinthelaw Sep 20, 2024
2bf474c
Update e2e-registry1-weekly.yaml
justinthelaw Sep 20, 2024
6effe8c
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 23, 2024
1641379
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 23, 2024
ca0ff03
update to 0.13.0, fix versioning
justinthelaw Sep 23, 2024
d365660
fix registry1 workflow, add prints
justinthelaw Sep 23, 2024
bdda602
merge with registry1 workflow
justinthelaw Sep 23, 2024
2e24a6b
chainguard login, fix registry1 uds setup
justinthelaw Sep 23, 2024
280927a
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 23, 2024
bd3d7ff
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 23, 2024
686e755
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 23, 2024
e109740
fix permissions
justinthelaw Sep 23, 2024
b4b767e
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 23, 2024
7948e33
fix permissions, pt.2
justinthelaw Sep 23, 2024
c468b2c
fix permissions
justinthelaw Sep 23, 2024
44d4a0e
centralize integration llm config, no-cache-dir
justinthelaw Sep 23, 2024
9c1811c
merge with testing branch, pt.1
justinthelaw Sep 23, 2024
c0af7c7
centralize integration llm config, pt.2
justinthelaw Sep 23, 2024
6c24d34
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 23, 2024
332d348
better make clean-all
justinthelaw Sep 23, 2024
14ab833
complete overhaul of registry1 weekly
justinthelaw Sep 23, 2024
3caed3a
revert formatting
justinthelaw Sep 23, 2024
c50e16a
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 23, 2024
cbbdc20
update yq command for zarf.yaml
justinthelaw Sep 23, 2024
8518d71
yq sub typo
justinthelaw Sep 23, 2024
f11dd73
go back to using latest bundle
justinthelaw Sep 23, 2024
4079620
package create modifications
justinthelaw Sep 23, 2024
dd52e03
typo UDS zarf package create
justinthelaw Sep 23, 2024
a4fb386
correct bundle pointers and mutation
justinthelaw Sep 23, 2024
7192692
different zarf package ref location
justinthelaw Sep 23, 2024
d465753
log level debug
justinthelaw Sep 23, 2024
58b67c6
confirm missing C lib, more dynamic API create
justinthelaw Sep 24, 2024
25a1223
README improvement
justinthelaw Sep 24, 2024
9185ebf
README improvement, pt.2
justinthelaw Sep 24, 2024
5ff7f1c
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 24, 2024
ee08217
0.13.0, merge with test branch
justinthelaw Sep 24, 2024
982533f
more FinishReason exception throwing
justinthelaw Sep 24, 2024
4c4b0b6
fix class method on FinishReason
justinthelaw Sep 24, 2024
78efedb
change method name
justinthelaw Sep 24, 2024
55546a7
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 25, 2024
c7ca585
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 25, 2024
072427a
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 25, 2024
5e545f6
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 25, 2024
91da0ce
modify release-please-config
justinthelaw Sep 25, 2024
240e2c1
weekly sunday 12AM pst
justinthelaw Sep 25, 2024
d673244
move install to JIT
justinthelaw Sep 25, 2024
81c598c
remove udsCliVersion
justinthelaw Sep 25, 2024
301e9dd
comment typo
justinthelaw Sep 25, 2024
340414f
add v to registry ref
justinthelaw Sep 25, 2024
8c4e194
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 25, 2024
beb643f
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 25, 2024
3defb55
better sub yq cmd
justinthelaw Sep 25, 2024
4fdec61
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 25, 2024
da1e466
add failure logging
justinthelaw Sep 25, 2024
b2b6905
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 25, 2024
3cfecf0
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 26, 2024
94d2385
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 26, 2024
ccd99e9
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 26, 2024
26932de
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 26, 2024
649406d
Update release-please-config.json
justinthelaw Sep 27, 2024
20a73b7
Update and rename e2e-registry1-weekly.yaml to weekly-registry1-e2e-t…
justinthelaw Sep 27, 2024
a4f4c0f
Update and rename weekly-registry1-e2e-testing.yaml to weekly-registr…
justinthelaw Sep 27, 2024
ab5871d
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 27, 2024
0928698
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 27, 2024
757166e
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Sep 27, 2024
5cca687
0.13.1
justinthelaw Sep 27, 2024
db7193e
Merge remote-tracking branch 'origin/main' into 835-upgrade-vllm-for-…
justinthelaw Sep 27, 2024
c878283
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 27, 2024
be13c59
filename typo
justinthelaw Sep 27, 2024
1264c4c
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 27, 2024
db7e27a
make target typo
justinthelaw Sep 27, 2024
ef2f559
env variabilized
justinthelaw Sep 27, 2024
bcc1287
make target just does not work
justinthelaw Sep 27, 2024
03837c9
image_versions explicit set
justinthelaw Sep 27, 2024
8e4faf3
image_versions explicit set, pt.2
justinthelaw Sep 27, 2024
7a3c365
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 27, 2024
f77bcfe
use version pattern from release.yaml
justinthelaw Sep 27, 2024
37093dd
merge and resolve release conflict
justinthelaw Sep 27, 2024
14351c1
remove the v
justinthelaw Sep 27, 2024
46174ed
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 27, 2024
ce4c30f
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Sep 30, 2024
f502e06
fix lint
justinthelaw Sep 30, 2024
af8c971
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Sep 30, 2024
fd2c153
cutover to utils.client.py
justinthelaw Oct 1, 2024
d22439e
Merge branch 'main' into chore-update-registry1-weekly-bundle-0.13.0
justinthelaw Oct 1, 2024
5c493ea
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
ae68868
cutover to utils.client.py, pt.2
justinthelaw Oct 1, 2024
2acb604
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
ca55f72
cutover to utils.client.py, pt.3
justinthelaw Oct 1, 2024
a42c320
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
807fbdc
fix text embeddings backend full
justinthelaw Oct 1, 2024
abff6bd
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
a9c34fb
remove extraneous env
justinthelaw Oct 1, 2024
8caf64f
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
a6f0af0
add get_supabase_url, default model warnings
justinthelaw Oct 1, 2024
0b291f4
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
d590268
supabase base url incorrect
justinthelaw Oct 1, 2024
54af6dc
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
17e20fa
supabase_url in wrong position
justinthelaw Oct 1, 2024
2c3b7f1
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
76efca3
Merge remote-tracking branch 'origin/main' into chore-update-registry…
justinthelaw Oct 1, 2024
1211e69
fastapi status code usage
justinthelaw Oct 1, 2024
7e6bdb2
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
a4c5ace
FinishReason _missing_ class method
justinthelaw Oct 1, 2024
5ee07cf
new missing JWT
justinthelaw Oct 1, 2024
df60811
Merge remote-tracking branch 'origin/chore-update-registry1-weekly-bu…
justinthelaw Oct 1, 2024
0c12449
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Oct 1, 2024
99c27c9
missing ZARF VAR passthrough to values
justinthelaw Oct 2, 2024
c106e10
more clarity in the README
justinthelaw Oct 3, 2024
d92b572
Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing
justinthelaw Oct 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions packages/vllm/.env.example
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
export LAI_HF_HUB_ENABLE_HF_TRANSFER="1"
export LAI_REPO_ID="TheBloke/Synthia-7B-v2.0-GPTQ"
export LAI_REVISION="gptq-4bit-32g-actorder_True"
export LAI_QUANTIZATION="gptq"
export LAI_REPO_ID="justinthelaw/Hermes-2-Pro-Mistral-7B-4bit-32g"
justinthelaw marked this conversation as resolved.
Show resolved Hide resolved
export LAI_REVISION="main"
export LAI_TENSOR_PARALLEL_SIZE=1
export LAI_TRUST_REMOTE_CODE=True
export LAI_MODEL_SOURCE=".model/"
export LAI_MAX_CONTEXT_LENGTH=32768
export LAI_STOP_TOKENS='["</s>","<|endoftext|>","<|im_end|>"]'
export LAI_PROMPT_FORMAT_CHAT_SYSTEM="SYSTEM: {}\n"
export LAI_PROMPT_FORMAT_CHAT_ASSISTANT="ASSISTANT: {}\n"
export LAI_PROMPT_FORMAT_CHAT_USER="USER: {}\n"
export LAI_STOP_TOKENS='["</s>"]'
export LAI_PROMPT_FORMAT_CHAT_SYSTEM="<|system|>\n{}<|end|>\n"
export LAI_PROMPT_FORMAT_CHAT_USER="<|user|>\n{}<|end|>\n"
export LAI_PROMPT_FORMAT_CHAT_ASSISTANT="<|assistant|>\n{}<|end|>\n"
justinthelaw marked this conversation as resolved.
Show resolved Hide resolved
export LAI_PROMPT_FORMAT_DEFAULTS_TOP_P=1.0
export LAI_PROMPT_FORMAT_DEFAULTS_TOP_K=0
export LAI_PROMPT_FORMAT_DEFAULTS_TOP_K=0
export LAI_ENFORCE_EAGER=False
export LAI_GPU_MEMORY_UTILIZATION=0.90
export LAI_WORKER_USE_RAY=True
export LAI_ENGINE_USE_RAY=True
184 changes: 96 additions & 88 deletions packages/vllm/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,125 +2,133 @@ ARG LOCAL_VERSION
FROM ghcr.io/defenseunicorns/leapfrogai/leapfrogai-sdk:${LOCAL_VERSION} AS sdk

FROM nvidia/cuda:12.2.2-devel-ubuntu22.04 AS builder
ARG SDK_DEST=src/leapfrogai_sdk/build

# Set the config file defaults
ARG PYTHON_VERSION=3.11.6
ARG HF_HUB_ENABLE_HF_TRANSFER="1"
ARG REPO_ID="TheBloke/Synthia-7B-v2.0-GPTQ"
ARG REVISION="gptq-4bit-32g-actorder_True"
ARG QUANTIZATION="gptq"
ARG MODEL_SOURCE="/data/.model/"
ARG MAX_CONTEXT_LENGTH=32768
ARG STOP_TOKENS='["</s>","<|endoftext|>","<|im_end|>"]'
ARG PROMPT_FORMAT_CHAT_SYSTEM="SYSTEM: {}\n"
ARG PROMPT_FORMAT_CHAT_ASSISTANT="ASSISTANT: {}\n"
ARG PROMPT_FORMAT_CHAT_USER="USER: {}\n"
ARG PROMPT_FORMAT_DEFAULTS_TOP_P=1.0
ARG PROMPT_FORMAT_DEFAULTS_TOP_K=0
ARG TENSOR_PARALLEL_SIZE=1

ENV DEBIAN_FRONTEND=noninteractive
# set SDK location
# set the pyenv and Python versions
justinthelaw marked this conversation as resolved.
Show resolved Hide resolved
# set model download args
ARG SDK_DEST=src/leapfrogai_sdk/build \
PYTHON_VERSION=3.11.6 \
PYENV_GIT_TAG=v2.4.8

# use root user for deps installation and nonroot user creation
USER root

# get deps for vllm compilation, pyenv, python and model downloading
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get -y install \
git \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
libncursesw5-dev \
tk-dev \
libffi-dev \
liblzma-dev

# setup nonroot user and permissions
RUN groupadd -g 65532 vglusers && \
useradd -ms /bin/bash nonroot -u 65532 -g 65532 && \
usermod -a -G video,sudo nonroot

# grab necesary python dependencies
# TODO @JPERRY: Get context as to why we are doing this for this Dockerfile but not our other ones
RUN apt-get -y update \
&& apt-get install -y software-properties-common \
&& add-apt-repository universe \
&& add-apt-repository ppa:deadsnakes/ppa \
&& apt-get -y update

# get deps for vllm compilation, model download, and pyenv
RUN apt-get -y install git python3-venv make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev tk-dev libffi-dev

USER nonroot
WORKDIR /home/leapfrogai

# copy-in SDK from sdk stage and vllm source code from host
WORKDIR /home/leapfrogai
COPY --from=sdk --chown=nonroot:nonroot /leapfrogai/${SDK_DEST} ./${SDK_DEST}
COPY --chown=nonroot:nonroot packages/vllm packages/vllm

# # create virtual environment for light-weight portability and minimal libraries
RUN git clone --depth=1 https://github.com/pyenv/pyenv.git .pyenv
ENV PYENV_ROOT="/home/leapfrogai/.pyenv"
ENV PATH="$PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH"
RUN pyenv install ${PYTHON_VERSION}
RUN pyenv global ${PYTHON_VERSION}
RUN python3 -m venv .venv
# create virtual environment for light-weight portability and minimal libraries
RUN curl https://pyenv.run | bash && \
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc && \
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc && \
echo 'eval "$(pyenv init -)"' >> ~/.bashrc && \
echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc

# Set environment variables
ENV PYENV_ROOT="/home/nonroot/.pyenv" \
PATH="/home/nonroot/.pyenv/bin:$PATH"

# Install Python 3.11.6, set it as global, and create a venv
RUN . ~/.bashrc && \
PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 3.11.6 && \
pyenv global 3.11.6 && \
pyenv exec python -m venv .venv

# set path to venv python
ENV PATH="/home/leapfrogai/.venv/bin:$PATH"

RUN rm -f packages/vllm/build/*.whl
RUN python -m pip wheel packages/vllm -w packages/vllm/build --find-links=${SDK_DEST}
RUN pip install packages/vllm/build/lfai_vllm*.whl --no-index --find-links=packages/vllm/build/
RUN rm -f packages/vllm/build/*.whl && \
python -m pip wheel packages/vllm -w packages/vllm/build --find-links=${SDK_DEST} && \
pip install packages/vllm/build/lfai_vllm*.whl --no-index --find-links=packages/vllm/build/

#################
# FINAL CONTAINER
#################

FROM nvidia/cuda:12.2.2-runtime-ubuntu22.04
## COPIED FROM ABOVE ##
ARG SDK_DEST=src/leapfrogai_sdk/build
# Set the config file defaults
ARG PYTHON_VERSION=3.11.6
ARG HF_HUB_ENABLE_HF_TRANSFER="1"
ARG REPO_ID="TheBloke/Synthia-7B-v2.0-GPTQ"
ARG REVISION="gptq-4bit-32g-actorder_True"
ARG QUANTIZATION="gptq"
ARG MODEL_SOURCE="/data/.model/"
ARG MAX_CONTEXT_LENGTH=32768
ARG STOP_TOKENS='["</s>","<|endoftext|>","<|im_end|>"]'
ARG PROMPT_FORMAT_CHAT_SYSTEM="SYSTEM: {}\n"
ARG PROMPT_FORMAT_CHAT_ASSISTANT="ASSISTANT: {}\n"
ARG PROMPT_FORMAT_CHAT_USER="USER: {}\n"
ARG PROMPT_FORMAT_DEFAULTS_TOP_P=1.0
ARG PROMPT_FORMAT_DEFAULTS_TOP_K=0
ARG TENSOR_PARALLEL_SIZE=1

ENV DEBIAN_FRONTEND=noninteractive
# set SDK location
ARG SDK_DEST=src/leapfrogai_sdk/build

# model-specific arguments
ARG TRUST_REMOTE_CODE="True" \
MODEL_SOURCE=".model/" \
MAX_CONTEXT_LENGTH=32768 \
STOP_TOKENS='["</s>"]' \
PROMPT_FORMAT_CHAT_SYSTEM="<|system|>\n{}<|end|>\n" \
PROMPT_FORMAT_CHAT_USER="<|user|>\n{}<|end|>\n" \
PROMPT_FORMAT_CHAT_ASSISTANT="<|assistant|>\n{}<|end|>\n" \
PROMPT_FORMAT_DEFAULTS_TOP_P=1.0 \
PROMPT_FORMAT_DEFAULTS_TOP_K=0 \
TENSOR_PARALLEL_SIZE=1 \
ENFORCE_EAGER=False \
GPU_MEMORY_UTILIZATION=0.99 \
WORKER_USE_RAY=True \
ENGINE_USE_RAY=True

# setup nonroot user and permissions
USER root

RUN groupadd -g 65532 vglusers && \
useradd -ms /bin/bash nonroot -u 65532 -g 65532 && \
usermod -a -G video,sudo nonroot

RUN apt-get -y update
RUN apt-get -y install git wget build-essential libssl-dev zlib1g-dev libffi-dev

USER nonroot

WORKDIR /home/leapfrogai

# copy-in SDK from sdk stagem model and vllm source code from builder
COPY --from=sdk --chown=nonroot:nonroot /leapfrogai/${SDK_DEST} ./${SDK_DEST}
COPY --from=builder --chown=nonroot:nonroot /home/leapfrogai/.venv /home/leapfrogai/.venv
COPY --from=builder --chown=nonroot:nonroot /home/leapfrogai/packages/vllm/src /home/leapfrogai/packages/vllm/src
# copy-in python binaries
COPY --from=builder --chown=nonroot:nonroot /home/nonroot/.pyenv/versions/3.11.6/ /home/nonroot/.pyenv/versions/3.11.6/

# load ARG values into env variables for pickup by confz
ENV LAI_TRUST_REMOTE_CODE=${TRUST_REMOTE_CODE} \
LAI_MODEL_SOURCE=${MODEL_SOURCE} \
LAI_MAX_CONTEXT_LENGTH=${MAX_CONTEXT_LENGTH} \
LAI_STOP_TOKENS=${STOP_TOKENS} \
LAI_PROMPT_FORMAT_CHAT_SYSTEM=${PROMPT_FORMAT_CHAT_SYSTEM} \
LAI_PROMPT_FORMAT_CHAT_USER=${PROMPT_FORMAT_CHAT_USER} \
LAI_PROMPT_FORMAT_CHAT_ASSISTANT=${PROMPT_FORMAT_CHAT_ASSISTANT} \
LAI_PROMPT_FORMAT_DEFAULTS_TOP_P=${PROMPT_FORMAT_DEFAULTS_TOP_P} \
LAI_PROMPT_FORMAT_DEFAULTS_TOP_K=${PROMPT_FORMAT_DEFAULTS_TOP_K} \
LAI_TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE} \
LAI_ENFORCE_EAGER=${ENFORCE_EAGER} \
LAI_GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION} \
LAI_WORKER_USE_RAY=${WORKER_USE_RAY} \
LAI_ENGINE_USE_RAY=${ENGINE_USE_RAY} \
# remove vLLM callback to stats server
VLLM_NO_USAGE_STATS=1

# # create virtual environment for light-weight portability and minimal libraries
RUN git clone --depth=1 https://github.com/pyenv/pyenv.git .pyenv
ENV PYENV_ROOT="/home/leapfrogai/.pyenv"
ENV PATH="$PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH"
RUN pyenv install ${PYTHON_VERSION}
ENV PATH="/home/leapfrogai/.venv/bin:$PATH"

# download model
ENV HF_HOME=/home/leapfrogai/.cache/huggingface

# Load ARG values into env variables for pickup by confz
ENV LAI_HF_HUB_ENABLE_HF_TRANSFER=${HF_HUB_ENABLE_HF_TRANSFER}
ENV LAI_REPO_ID=${REPO_ID}
ENV LAI_REVISION=${REVISION}
ENV LAI_QUANTIZATION=${QUANTIZATION}
ENV LAI_TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE}
ENV LAI_MODEL_SOURCE=${MODEL_SOURCE}
ENV LAI_MAX_CONTEXT_LENGTH=${MAX_CONTEXT_LENGTH}
ENV LAI_STOP_TOKENS=${STOP_TOKENS}
ENV LAI_PROMPT_FORMAT_CHAT_SYSTEM=${PROMPT_FORMAT_CHAT_SYSTEM}
ENV LAI_PROMPT_FORMAT_CHAT_ASSISTANT=${PROMPT_FORMAT_CHAT_ASSISTANT}
ENV LAI_PROMPT_FORMAT_CHAT_USER=${PROMPT_FORMAT_CHAT_USER}
ENV LAI_PROMPT_FORMAT_DEFAULTS_TOP_P=${PROMPT_FORMAT_DEFAULTS_TOP_P}
ENV LAI_PROMPT_FORMAT_DEFAULTS_TOP_K=${PROMPT_FORMAT_DEFAULTS_TOP_K}

EXPOSE 50051:50051

ENTRYPOINT ["python", "-m", "leapfrogai_sdk.cli", "--app-dir=packages/vllm/src/", "main:Model"]
14 changes: 8 additions & 6 deletions packages/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

A LeapfrogAI API-compatible [vLLM](https://github.com/vllm-project/vllm) wrapper for quantized and un-quantized model inferencing across GPU infrastructures.


## Usage

See [instructions](#instructions) to get the backend up and running. Then, use the [LeapfrogAI API server](https://github.com/defenseunicorns/leapfrogai-api) to interact with the backend.
Expand All @@ -21,15 +20,17 @@ The following are additional assumptions for GPU inferencing:

### Model Selection

The default model that comes with this backend in this repository's officially released images is a [4-bit quantization of the Synthia-7b model](https://huggingface.co/TheBloke/SynthIA-7B-v2.0-GPTQ).
The default model that comes with this backend in this repository's officially released images is a [4-bit quantization of the Phi-3-Mini-128k-Instruct model](https://huggingface.co/bsmit1659/Phi-3-mini-128k-instruct-0.2-awq).

You can optionally specify different models or quantization types using the following Docker build arguments:

- `--build-arg HF_HUB_ENABLE_HF_TRANSFER="1"`: Enable or disable HuggingFace Hub transfer (default: 1)
- `--build-arg REPO_ID="TheBloke/Synthia-7B-v2.0-GPTQ"`: HuggingFace repository ID for the model
- `--build-arg REVISION="gptq-4bit-32g-actorder_True"`: Revision or commit hash for the model
- `--build-arg QUANTIZATION="gptq"`: Quantization type (e.g., gptq, awq, or empty for un-quantized)
- `--build-arg MAX_CONTEXT_LENGTH="32768"`: Max context length, cannot exceed model's max length - the greater length the greater the vRAM requirements
- `--build-arg TENSOR_PARALLEL_SIZE="1"`: The number of gpus to spread the tensor processing across
- `--build-arg TRUST_REMOTE_CODE="True"`: Whether to trust inferencing code downloaded as part of the model download
- `--build-arg ENGINE_USE_RAY="False"`: Distributed, multi-node inferencing mode for the engine
- `--build-arg WORKER_USE_RAY="False"`: Distributed, multi-node inferencing mode for the worker(s)
- `--build-arg GPU_MEMORY_UTILIZATION="0.99"`: Max memory utilization (fraction, out of 1.0) for the vLLM process
- `--build-arg ENFORCE_EAGER="False"`: Disable CUDA graphs for faster token first-inferencing at the cost of more GPU memory (set to False for production)

## Zarf Package Deployment

Expand All @@ -46,6 +47,7 @@ uds zarf package deploy packages/vllm/zarf-package-vllm-*-dev.tar.zst --confirm
## Run Locally

To run the vllm backend locally (starting from the root directory of the repository):

```bash
# Setup Virtual Environment if you haven't done so already
python -m venv .venv
Expand Down
2 changes: 1 addition & 1 deletion packages/vllm/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ version = "0.9.2"

dependencies = [
"pydantic >= 2.3.0",
"vllm==0.4.2",
"vllm==0.5.3.post1",
"python-dotenv>=1.0.1",
"aiostream>=0.5.2",
"leapfrogai-sdk",
Expand Down
60 changes: 52 additions & 8 deletions packages/vllm/src/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,39 +5,85 @@


class ConfigOptions(BaseConfig):
quantization: Literal[None, "awq", "gptq", "squeezellm"] = Field(
default=None,
description="Type of quantization, for un-quantized models omit this field",
)
tensor_parallel_size: int = Field(
default=1,
title="GPU Utilization Count",
description="The number of gpus to spread the tensor processing across."
"This must be divisible to the number of attention heads in the model",
examples=[1, 2, 3],
)
enforce_eager: bool = Field(
default=True,
title="Enable Eager Mode",
description="Enable eager mode to start token generation immediately after prompt processing."
"Potentially reduces initial latency at the cost of slightly higher memory usage."
"Should be set to False in production environments with higher GPU memory.",
examples=[True, False],
)
gpu_memory_utilization: float = Field(
default=0.99,
justinthelaw marked this conversation as resolved.
Show resolved Hide resolved
title="GPU Memory Limit",
description="Maximum amount of GPU vRAM allocated to the vLLM engine and worker(s)",
examples=[0.50, 0.90, 0.99],
)
engine_use_ray: bool = Field(
default=True,
title="Use Ray for Engine",
description="Enable distributed inferencing for multi-node situations.",
examples=[True, False],
)
worker_use_ray: bool = Field(
default=True,
title="Use Ray for Worker",
description="Enable distributed inferencing for multi-node situations.",
examples=[True, False],
)
trust_remote_code: bool = Field(
default=True,
title="Trust Downloaded Model Code",
description="Whether to trust inferencing code downloaded as part of the model download."
"Please review the Python code in the .model/ directory before trusting custom model code.",
examples=[True, False],
)


class DownloadOptions(BaseConfig):
hf_hub_enable_hf_transfer: Literal["0", "1"] = Field(
description="Option (0 - Disable, 1 - Enable) for faster transfers, tradeoff stability for faster speeds"
)
repo_id: str = Field(
description="HuggingFace repo id",
description="The HuggingFace git repository ID",
examples=[
"TheBloke/Synthia-7B-v2.0-GPTQ",
"migtissera/Synthia-MoE-v3-Mixtral-8x7B",
"microsoft/phi-2",
],
)
revision: str = Field(
description="The model branch to use",
description="The HuggingFace repository git branch to use",
examples=["main", "gptq-4bit-64g-actorder_True"],
)


class AppConfig(BaseConfig):
backend_options: ConfigOptions
CONFIG_SOURCES = [
EnvSource(
allow_all=True,
prefix="LAI_",
remap={
"tensor_parallel_size": "backend_options.tensor_parallel_size",
"trust_remote_code": "backend_options.trust_remote_code",
"enforce_eager": "backend_options.enforce_eager",
"gpu_memory_utilization": "backend_options.gpu_memory_utilization",
"worker_use_ray": "backend_options.worker_use_ray",
"engine_use_ray": "backend_options.engine_use_ray",
},
)
]


class DownloadConfig(BaseConfig):
download_options: Optional[DownloadOptions]
CONFIG_SOURCES = [
EnvSource(
Expand All @@ -47,8 +93,6 @@ class AppConfig(BaseConfig):
"hf_hub_enable_hf_transfer": "download_options.hf_hub_enable_hf_transfer",
"repo_id": "download_options.repo_id",
"revision": "download_options.revision",
"quantization": "backend_options.quantization",
"tensor_parallel_size": "backend_options.tensor_parallel_size",
},
)
]
Loading