forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
17 changed files
with
753 additions
and
116 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,140 +1,111 @@ | ||
<p align="center"> | ||
<picture> | ||
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png"> | ||
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%> | ||
</picture> | ||
</p> | ||
# vLLM | ||
|
||
<h3 align="center"> | ||
Easy, fast, and cheap LLM serving for everyone | ||
</h3> | ||
This repo is a fork of the [vLLM](https://github.com/vllm-project/vllm) repo. | ||
|
||
<p align="center"> | ||
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | | ||
## Usage | ||
|
||
</p> | ||
Pull the `latest` image from ECR: | ||
|
||
```bash | ||
bash docker/pull.sh vllm:latest | ||
``` | ||
|
||
--- | ||
Run the container (with Llama3 8B in this case): | ||
|
||
**vLLM, AMD, Anyscale Meet & Greet at [Ray Summit 2024](http://raysummit.anyscale.com) (Monday, Sept 30th, 5-7pm PT) at Marriott Marquis San Francisco** | ||
```bash | ||
docker run --runtime nvidia --gpus all \ | ||
-v ~/.cache/huggingface:/root/.cache/huggingface \ | ||
-p 8000:8000 \ | ||
--ipc=host \ | ||
vllm \ | ||
--model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
We are excited to announce our special vLLM event in collaboration with AMD and Anyscale. | ||
Join us to learn more about recent advancements of vLLM on MI300X. | ||
Register [here](https://lu.ma/db5ld9n5) and be a part of the event! | ||
## Development | ||
|
||
--- | ||
### Setup dev mode | ||
|
||
*Latest News* 🔥 | ||
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing). | ||
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing). | ||
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html). | ||
- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing). | ||
- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing). | ||
- [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing). | ||
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing). | ||
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM. | ||
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai). | ||
Clone the repo and setup the base docker image: | ||
|
||
--- | ||
## About | ||
vLLM is a fast and easy-to-use library for LLM inference and serving. | ||
```bash | ||
docker run --gpus all -it --rm --ipc=host \ | ||
-v $(pwd):/workspace/vllm \ | ||
-v ~/.cache/huggingface:/root/.cache/huggingface \ | ||
-p 8000:8000 \ | ||
nvcr.io/nvidia/pytorch:23.10-py3 | ||
``` | ||
|
||
vLLM is fast with: | ||
Once done, install vLLM in dev mode and the dev requirements in the container: | ||
|
||
- State-of-the-art serving throughput | ||
- Efficient management of attention key and value memory with **PagedAttention** | ||
- Continuous batching of incoming requests | ||
- Fast model execution with CUDA/HIP graph | ||
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8. | ||
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer. | ||
- Speculative decoding | ||
- Chunked prefill | ||
```bash | ||
cd vllm | ||
pip install -e . | ||
pip install -r requirements-dev.txt | ||
pip install boto3 | ||
``` | ||
|
||
**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)). | ||
It will take a while but once done, open another terminal **on the host** and run: | ||
|
||
vLLM is flexible and easy to use with: | ||
```bash | ||
docker commit <container_id> vllm_dev | ||
``` | ||
|
||
- Seamless integration with popular Hugging Face models | ||
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more | ||
- Tensor parallelism and pipeline parallelism support for distributed inference | ||
- Streaming outputs | ||
- OpenAI-compatible API server | ||
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron. | ||
- Prefix caching support | ||
- Multi-lora support | ||
This will create a new image `vllm_dev` with the vLLM code installed. You won't need to install the dev dependencies again each time you start a new container. | ||
|
||
vLLM seamlessly supports most popular open-source models on HuggingFace, including: | ||
- Transformer-like LLMs (e.g., Llama) | ||
- Mixture-of-Expert LLMs (e.g., Mixtral) | ||
- Embedding Models (e.g. E5-Mistral) | ||
- Multi-modal LLMs (e.g., LLaVA) | ||
From now on, you can exit the initial container and run this command to enter into the dev container: | ||
|
||
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html). | ||
```bash | ||
docker run --gpus all -it --rm --ipc=host \ | ||
-v $(pwd):/workspace/vllm \ | ||
-v ~/.cache/huggingface:/root/.cache/huggingface \ | ||
-p 8000:8000 \ | ||
vllm_dev | ||
``` | ||
|
||
## Getting Started | ||
### Launch the server | ||
|
||
Install vLLM with `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source): | ||
Enter into the `vllm_dev` container and run: | ||
|
||
```bash | ||
pip install vllm | ||
python -m vllm.entrypoints.openai.api_server \ | ||
--model meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
|
||
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more. | ||
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html) | ||
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) | ||
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html) | ||
|
||
## Contributing | ||
|
||
We welcome and value any contributions and collaborations. | ||
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved. | ||
|
||
## Sponsors | ||
|
||
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support! | ||
|
||
<!-- Note: Please sort them in alphabetical order. --> | ||
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md --> | ||
|
||
- a16z | ||
- AMD | ||
- Anyscale | ||
- AWS | ||
- Crusoe Cloud | ||
- Databricks | ||
- DeepInfra | ||
- Dropbox | ||
- Google Cloud | ||
- Lambda Lab | ||
- NVIDIA | ||
- Replicate | ||
- Roblox | ||
- RunPod | ||
- Sequoia Capital | ||
- Skywork AI | ||
- Trainy | ||
- UC Berkeley | ||
- UC San Diego | ||
- ZhenFund | ||
|
||
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM. | ||
|
||
## Citation | ||
|
||
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180): | ||
```bibtex | ||
@inproceedings{kwon2023efficient, | ||
title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, | ||
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, | ||
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, | ||
year={2023} | ||
} | ||
### Format the code | ||
|
||
Enter into the `vllm_dev` container and run: | ||
|
||
```bash | ||
bash format.sh | ||
``` | ||
|
||
### Build the image | ||
|
||
Once your changes are ready, you can build the prod image. Run these commands **on the host**: | ||
|
||
```bash | ||
bash docker/build.sh | ||
``` | ||
|
||
And deploy it to ECR: | ||
|
||
```bash | ||
bash docker/deploy.sh <version> | ||
``` | ||
|
||
### Upgrade version | ||
|
||
You can upgrade the version of vLLM by rebasing on the official repo: | ||
|
||
```bash | ||
git clone https://github.com/lightonai/vllm | ||
git remote add official https://github.com/vllm-project/vllm | ||
git fetch official | ||
git rebase <commit_sha> # Rebase on a specific commit of the official repo (i.e. the commit sha of the last stable release) | ||
git rebase --continue # After resolving conflicts (if any), continue the rebase | ||
git push origin main --force | ||
``` | ||
|
||
## Contact Us | ||
## Deployment | ||
|
||
* For technical questions and feature requests, please use Github issues or discussions. | ||
* For discussing with fellow users, please use Discord. | ||
* For security disclosures, please use Github's security advisory feature. | ||
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu. | ||
To deploy a model on Sagemaker, follow this [README](https://github.com/lightonai/vllm/blob/main/sagemaker/README.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm --build-arg VLLM_MAX_SIZE_MB=400 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
#!/bin/bash | ||
|
||
# Check if a version number was provided as an argument | ||
if [ -z "$1" ]; then | ||
echo "Usage: $0 <version-number>" | ||
exit 1 | ||
fi | ||
|
||
VERSION_NUMBER=$1 | ||
|
||
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) | ||
|
||
if [ $? -ne 0 ] | ||
then | ||
exit 255 | ||
fi | ||
|
||
REGION=us-west-2 | ||
|
||
REPOSITORY_NAME="vllm" | ||
CONTAINER_URI="${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPOSITORY_NAME}" | ||
|
||
# Log in to ECR | ||
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin "${ACCOUNT_ID}".dkr.ecr."${REGION}".amazonaws.com | ||
|
||
# Check if repository exists and create it if not. | ||
aws ecr describe-repositories --repository-names "${REPOSITORY_NAME}" --region "${REGION}" > /dev/null 2>&1 | ||
if [ $? -ne 0 ] | ||
then | ||
aws ecr create-repository --repository-name "${REPOSITORY_NAME}" --region "${REGION}" > /dev/null | ||
fi | ||
|
||
# Build the Docker image with specified build arguments. | ||
if bash docker/build.sh; | ||
then | ||
docker tag "${REPOSITORY_NAME}" "$CONTAINER_URI:${VERSION_NUMBER}" | ||
docker push "$CONTAINER_URI:${VERSION_NUMBER}" | ||
|
||
# Ask the user if the image should be tagged as latest. | ||
read -p "Tag the image as latest? (y/n): " TAG_LATEST | ||
|
||
if [ "$TAG_LATEST" == "y" ]; then | ||
docker tag "${REPOSITORY_NAME}" "$CONTAINER_URI:latest" | ||
docker push "$CONTAINER_URI:latest" | ||
fi | ||
else | ||
echo "Docker build failed." | ||
exit 2 | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
#!/bin/bash | ||
|
||
# Check if an image was provided as an argument | ||
if [ -z "$1" ]; then | ||
echo "Usage: $0 <image>" | ||
exit 1 | ||
fi | ||
|
||
IMAGE=$1 | ||
|
||
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) | ||
|
||
if [ $? -ne 0 ] | ||
then | ||
exit 255 | ||
fi | ||
REGION=us-west-2 | ||
|
||
CONTAINER_URI="${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/$IMAGE" | ||
|
||
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin "${ACCOUNT_ID}".dkr.ecr."${REGION}".amazonaws.com | ||
|
||
docker pull $CONTAINER_URI | ||
|
||
docker tag $CONTAINER_URI $IMAGE |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
boto3 | ||
fire | ||
botocore | ||
requests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Deploy a model to SageMaker | ||
|
||
## Install dependencies | ||
|
||
```bash | ||
pip install -r requirements-deploy.txt | ||
``` | ||
|
||
## Deploy a model | ||
|
||
Deploy a model to SageMaker: | ||
|
||
```bash | ||
python sagemaker/deploy.py --config_path sagemaker/configs/llama3-8b.json | ||
``` | ||
|
||
> You can create more models configs in the `sagemaker/configs` folder. | ||
To clean up the SageMaker resources: | ||
|
||
```bash | ||
python sagemaker/cleanup.py --endpoint_name <endpoint_name> | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
import boto3 | ||
from botocore.config import Config | ||
import fire | ||
|
||
|
||
def cleanup(endpoint_name: str, region: str = "us-west-2"): | ||
""" | ||
Cleanup a model from SageMaker. | ||
Args: | ||
endpoint_name: The name of the endpoint to cleanup. | ||
region: The AWS region to cleanup from. | ||
""" | ||
config = Config(region_name=region) | ||
sm_client = boto3.client(service_name="sagemaker", config=config) | ||
|
||
resp = sm_client.describe_endpoint(EndpointName=endpoint_name) | ||
endpoint_config_name = resp["EndpointConfigName"] | ||
model_name = sm_client.describe_endpoint_config( | ||
EndpointConfigName=endpoint_config_name | ||
)["ProductionVariants"][0]["ModelName"] | ||
|
||
sm_client.delete_endpoint(EndpointName=endpoint_name) | ||
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name) | ||
sm_client.delete_model(ModelName=model_name) | ||
|
||
|
||
if __name__ == "__main__": | ||
fire.Fire(cleanup) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{ | ||
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct", | ||
"image": "vllm:0.6.0-1", | ||
"sagemaker_instance_type": "ml.p4d.24xlarge", | ||
"env_vars": { | ||
"TENSOR_PARALLEL_SIZE": "8", | ||
"DISABLE_CUSTOM_ALL_REDUCE": "true", | ||
"MAX_MODEL_LEN": "32768" | ||
} | ||
} |
Oops, something went wrong.