Skip to content

Commit

Permalink
Merge pull request #20 from runpod-workers/update-actions
Browse files Browse the repository at this point in the history
fix: update badge
  • Loading branch information
justinmerrell authored Dec 14, 2023
2 parents eaa0e86 + 06660fb commit 6366dac
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 42 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@

runpod.toml
82 changes: 40 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,15 @@

<h1>vLLM Endpoint | Serverless Worker </h1>

[![CI | Test Worker](https://github.com/runpod-workers/worker-template/actions/workflows/CI-test_worker.yml/badge.svg)](https://github.com/runpod-workers/worker-template/actions/workflows/CI-test_worker.yml)
&nbsp;
[![Docker Image](https://github.com/runpod-workers/worker-template/actions/workflows/CD-docker_dev.yml/badge.svg)](https://github.com/runpod-workers/worker-template/actions/workflows/CD-docker_dev.yml)
[![CD | Docker-Build-Release](https://github.com/runpod-workers/worker-vllm/actions/workflows/docker-build-release.yml/badge.svg)](https://github.com/runpod-workers/worker-vllm/actions/workflows/docker-build-release.yml)

🚀 | This serverless worker utilizes vLLM behind the scenes and is integrated into RunPod's serverless environment. It supports dynamic auto-scaling using the built-in RunPod autoscaling feature.
</div>

## Setting up the Serverless Worker

### Option 1:Deploy Any Model Using Pre-Built Docker Image
We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:
We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:

<div align="center">

Expand All @@ -23,7 +21,7 @@ We now offer a pre-built Docker Image for the vLLM Worker that you can configure
#### Environment Variables
- **Required**:
- `MODEL_NAME`: Hugging Face Model Repository (e.g., `openchat/openchat_3.5`).

- **Optional**:
- `MODEL_BASE_PATH`: Model storage directory (default: `/runpod-volume`).
- `HF_TOKEN`: Hugging Face token for private and gated models (e.g., Llama, Falcon).
Expand All @@ -48,63 +46,63 @@ To build an image with the model baked in, you must specify the following docker
`sudo docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg MODEL_BASE_PATH="/models" .`

### Compatible Models
- LLaMA & LLaMA-2
- LLaMA & LLaMA-2
- Mistral
- Mixtral (Mistral MoE)
- Yi
- ChatGLM
- Phi
- MPT
- OPT
- Qwen
- Aquila & Aquila2
- MPT
- OPT
- Qwen
- Aquila & Aquila2
- Baichuan
- BLOOM
- Falcon
- BLOOM
- Falcon
- GPT-2
- GPT BigCode
- GPT-J
- GPT-NeoX
- InternLM

And any other models supported by vLLM 0.2.4.


Ensure that you have Docker installed and properly set up before running the docker build commands. Once built, you can deploy this serverless worker in your desired environment with confidence that it will automatically scale based on demand. For further inquiries or assistance, feel free to contact our support team.


## Model Inputs
| Argument | Type | Default | Description |
|--------------------|-----------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| prompt | str | | Prompt string to generate text based on. |
| sampling_params | dict | {} | Sampling parameters to control the generation, like temperature, top_p, etc. |
| streaming | bool | False | Whether to enable streaming of output. If True, responses are streamed as they are generated. |
| batch_size | int | DEFAULT_BATCH_SIZE | The number of responses to generate in one batch. Only applicable
| Argument | Type | Default | Description |
|-----------------|------|--------------------|-----------------------------------------------------------------------------------------------|
| prompt | str | | Prompt string to generate text based on. |
| sampling_params | dict | {} | Sampling parameters to control the generation, like temperature, top_p, etc. |
| streaming | bool | False | Whether to enable streaming of output. If True, responses are streamed as they are generated. |
| batch_size | int | DEFAULT_BATCH_SIZE | The number of responses to generate in one batch. Only applicable |

### Sampling Parameters
| Argument | Type | Default | Description |
|---------------------------------|--------------------------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| n | int | 1 | Number of output sequences to return for the given prompt. |
| best_of | Optional[int] | None | Number of output sequences generated from the prompt. The top `n` sequences are returned from these `best_of` sequences. Must be ≥ `n`. Treated as beam width in beam search. Default is `n`. |
| presence_penalty | float | 0.0 | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
| frequency_penalty | float | 0.0 | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
| repetition_penalty | float | 1.0 | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. |
| temperature | float | 1.0 | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. |
| top_p | float | 1.0 | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | int | -1 | Controls the number of top tokens to consider. Set to -1 to consider all tokens. |
| min_p | float | 0.0 | Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. |
| use_beam_search | bool | False | Whether to use beam search instead of sampling. |
| length_penalty | float | 1.0 | Penalizes sequences based on their length. Used in beam search. |
| early_stopping | Union[bool, str] | False | Controls stopping condition in beam search. Can be `True`, `False`, or `"never"`. |
| stop | Union[None, str, List[str]] | None | List of strings that stop generation when produced. Output will not contain these strings. |
| stop_token_ids | Optional[List[int]] | None | List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens. |
| ignore_eos | bool | False | Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation. |
| max_tokens | int | 16 | Maximum number of tokens to generate per output sequence. |
| logprobs | Optional[int] | None | Number of log probabilities to return per output token. |
| prompt_logprobs | Optional[int] | None | Number of log probabilities to return per prompt token. |
| skip_special_tokens | bool | True | Whether to skip special tokens in the output. |
| spaces_between_special_tokens | bool | True | Whether to add spaces between special tokens in the output. |
| Argument | Type | Default | Description |
|-------------------------------|-----------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| n | int | 1 | Number of output sequences to return for the given prompt. |
| best_of | Optional[int] | None | Number of output sequences generated from the prompt. The top `n` sequences are returned from these `best_of` sequences. Must be ≥ `n`. Treated as beam width in beam search. Default is `n`. |
| presence_penalty | float | 0.0 | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
| frequency_penalty | float | 0.0 | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
| repetition_penalty | float | 1.0 | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. |
| temperature | float | 1.0 | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. |
| top_p | float | 1.0 | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
| top_k | int | -1 | Controls the number of top tokens to consider. Set to -1 to consider all tokens. |
| min_p | float | 0.0 | Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. |
| use_beam_search | bool | False | Whether to use beam search instead of sampling. |
| length_penalty | float | 1.0 | Penalizes sequences based on their length. Used in beam search. |
| early_stopping | Union[bool, str] | False | Controls stopping condition in beam search. Can be `True`, `False`, or `"never"`. |
| stop | Union[None, str, List[str]] | None | List of strings that stop generation when produced. Output will not contain these strings. |
| stop_token_ids | Optional[List[int]] | None | List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens. |
| ignore_eos | bool | False | Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation. |
| max_tokens | int | 16 | Maximum number of tokens to generate per output sequence. |
| logprobs | Optional[int] | None | Number of log probabilities to return per output token. |
| prompt_logprobs | Optional[int] | None | Number of log probabilities to return per prompt token. |
| skip_special_tokens | bool | True | Whether to skip special tokens in the output. |
| spaces_between_special_tokens | bool | True | Whether to add spaces between special tokens in the output. |


## Sample Inputs and Outputs

Expand Down

0 comments on commit 6366dac

Please sign in to comment.