Skip to content

Commit

Permalink
docs: simplify getting started (#43)
Browse files Browse the repository at this point in the history
  • Loading branch information
tgaddair authored Nov 17, 2023
1 parent 68a3ba4 commit e6c5c5d
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 9 deletions.
89 changes: 81 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve over a hundred f
- [Adapters](#adapters)
- [🏃‍♂️ Getting started](#️-getting-started)
- [Docker](#docker)
- [Kubernetes (Helm)](#-kubernetes-helm)
- [📓 API documentation](#-api-documentation)
- [🛠️ Local Development](#️-local-development)
- [CUDA Kernels](#cuda-kernels)
Expand Down Expand Up @@ -59,6 +60,8 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve over a hundred f

Other architectures are supported on a best effort basis, but do not support dynamical adapter loading.

Check the [HuggingFace Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads) to find supported base models.

### Adapters

LoRAX currently supports LoRA adapters, which can be trained using frameworks like [PEFT](https://github.com/huggingface/peft) and [Ludwig](https://ludwig.ai/).
Expand All @@ -70,11 +73,19 @@ The following modules can be targeted:
- `v_proj`
- `o_proj`

You can provide an adapter from the HuggingFace Hub, a local file path, or S3.

Just make sure that the adapter was trained on the same base model used in the deployment. LoRAX only supports one base model at a time, but any number of adapters derived from it!

## 🏃‍♂️ Getting started

### Docker

The easiest way of getting started is using the official Docker container:
We recommend starting with our pre-build Docker image to avoid compiling custom CUDA kernels and other dependencies.

#### 1. Start Docker container with base LLM

In this example, we'll use Mistral-7B-Instruct as the base model, but you can use any Mistral or Llama model from HuggingFace.

```shell
model=mistralai/Mistral-7B-Instruct-v0.1
Expand All @@ -84,28 +95,33 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibas
```
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.

To see all options to serve your models (in the [code](https://github.com/predibase/lorax/blob/main/launcher/src/main.rs) or in the cli:
To see all options to serve your models:

```
lorax-launcher --help
```

You can then query the model using either the `/generate` or `/generate_stream` routes:
#### 2. Prompt the base model

LoRAX supports the same `/generate` and `/generate_stream` REST API from [text-generation-inference](https://github.com/huggingface/text-generation-inference) for prompting the base model.

REST:

```shell
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "parameters": {"adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
-d '{"inputs": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"}' \
-H 'Content-Type: application/json'
```

```shell
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "parameters": {"adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
-d '{"inputs": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"}' \
-H 'Content-Type: application/json'
```

or from Python:
Python:

```shell
pip install lorax-client
Expand All @@ -117,15 +133,72 @@ from lorax import Client
client = Client("http://127.0.0.1:8080")
prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

print(client.generate(prompt, adapter_id="vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k").generated_text)
print(client.generate(prompt).generated_text)

text = ""
for response in client.generate_stream(prompt, adapter_id="vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"):
for response in client.generate_stream(prompt):
if not response.token.special:
text += response.token.text
print(text)
```

#### 3. Prompt with a LoRA Adapter

You probably noticed that the response from the base model wasn't what you would expect. So let's now prompt our LLM again with a LoRA adapter
trained to answer this type of question.

REST:

```shell
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "parameters": {"adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
-H 'Content-Type: application/json'
```

```shell
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "parameters": {"adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
-H 'Content-Type: application/json'
```

Python:

```python
adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"

print(client.generate(prompt, adapter_id=adapter_id).generated_text)

text = ""
for response in client.generate_stream(prompt, adapter_id=adapter_id):
if not response.token.special:
text += response.token.text
print(text)
```

### Kubernetes (Helm)

LoRAX includes Helm charts that make it easy to start using LoRAX in production with high availability and load balancing on Kubernetes.

To spin up a LoRAX deployment with Helm, you only need to be connected to a Kubernetes cluster through `kubectl``. We provide a default values.yaml file that can be used to deploy a Mistral 7B base model to your Kubernetes cluster:

```shell
helm install mistral-7b-release charts/lorax
```

The default [values.yaml](charts/lorax/values.yaml) configuration deploys a single replica of the Mistral 7B model. You can tailor configuration parameters to deploy any Llama or Mistral model by creating a new values file from the template and updating variables. Once a new values file is created, you can run the following command to deploy your LLM with LoRAX:

```shell
helm install -f your-values-file.yaml your-model-release charts/lorax
```

To delete the resources:

```shell
helm uninstall your-model-release
```

### 📓 API documentation

You can consult the OpenAPI documentation of the `lorax` REST API using the `/docs` route.
Expand Down
3 changes: 2 additions & 1 deletion server/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ install: gen-server install-torch
pip install -e ".[bnb, accelerate]"

run-dev:
SAFETENSORS_FAST_GPU=1 python -m torch.distributed.run --nproc_per_node=1 lorax_server/cli.py serve meta-llama/Llama-2-7b-hf --sharded
# SAFETENSORS_FAST_GPU=1 python -m torch.distributed.run --nproc_per_node=1 lorax_server/cli.py serve meta-llama/Llama-2-7b-hf --sharded
SAFETENSORS_FAST_GPU=1 python -m torch.distributed.run --nproc_per_node=1 lorax_server/cli.py serve mistralai/Mistral-7B-Instruct-v0.1 --sharded
# SAFETENSORS_FAST_GPU=1 python -m torch.distributed.run --nproc_per_node=1 lorax_server/cli.py serve alexsherstinsky/Mistral-7B-v0.1-sharded --sharded

export-requirements:
Expand Down

0 comments on commit e6c5c5d

Please sign in to comment.