This guide provides step-by-step instructions on setting up and deploying vLLM Llama3.1-70B and vLLM Mock models using TT-Studio.
-
Docker: Make sure Docker is installed on your system. Follow the Docker installation guide.
-
Hugging Face Token: Both models require authentication to Hugging Face repositories. To obtain a token, go to Hugging Face Account and generate a token. Additionally; make sure to accept the terms and conditions on Hugging Face for the Llama3.1 models by visiting Hugging Face Meta-Llama Page.
-
Model Access Weight: To access specific models like Llama3.1, you may need to register with Meta to obtain download links for model weights. Visit Llama Downloads for more information.
- Clone repositories
- Pull the mock model Docker image
- Set up the Hugging Face (HF) token
- Run the mock vLLM model via the GUI
- Clone repositories
- Pull the model Docker image
- Set up the Hugging Face (HF) token in the TT-Studio
.env
file - Run the model setup script
- Update the vLLM Environment Variable in Environment File
- Deploy and run inference for the Llama3.1-70B model via the GUI
Start by cloning both the tt-studio
and tt-inference-server
repositories.
# Clone tt-studio
git clone https://github.com/tenstorrent/tt-studio
cd tt-studio
# Make the setup script executable
chmod +x startup.sh
# Clone `tt-inference-server` into a separate directory
cd ..
git clone https://github.com/tenstorrent/tt-inference-server
-
Navigate to the Docker Images:
-
Pull the Docker Image:
docker pull ghcr.io/tenstorrent/tt-inference-server:<image-tag>
-
Authenticate Your Terminal (Optional):
echo YOUR_PAT | docker login ghcr.io -u YOUR_USERNAME --password-stdin
Add the Hugging Face Token within the .env
file in the tt-studio/app/
directory.
HF_TOKEN=hf_********
Follow these step-by-step instructions for a smooth automated process of model weights setup.
-
Navigate to the
vllm-tt-metal-llama3-70b/
folder within thett-inference-server
. This folder contains the necessary files and scripts for model setup. -
Run the automated setup script as outlined in the official documentation. This script handles key steps such as configuring environment variables, downloading weight files, repacking weights, and creating directories.
Note During the setup process, you will see the following prompt:
Enter your PERSISTENT_VOLUME_ROOT [default: tt-inference-server/tt_inference_server_persistent_volume]:
Do not accept the default path. Instead, set the persistent volume path to tt-studio/tt_studio_persistent_volume
. This ensures the configuration matches TT-Studio’s directory structure. Using the default path may result in incorrect configuration.
By following these instructions, you will have a properly configured model infrastructure, ready for inference and further development.
Verify that the weights are correctly stored in the following structure:
/path/to/tt-studio/tt_studio_persistent_volume/
└── volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/
├── layers_0-4.pth
├── layers_5-9.pth
├── params.json
└── tokenizer.model
What to Look For:
- Ensure all expected weight files (e.g.,
layers_0-4.pth
,params.json
,tokenizer.model
) are present. - If any files are missing, re-run the
setup.sh
script to complete the download.
This folder structure allows TT Studio to automatically recognize and access models without further configuration adjustments. For each model, verify that the weights are correctly copied to this directory to ensure proper access by TT Studio.
During the model weights download process, an .env
file will be automatically created. The path to the .env
file might resemble the following example:
/path/to/tt-inference-server/vllm-tt-metal-llama3-70b/.env
To ensure the model can be deployed via the TT-Studio GUI, this .env
file must be copied to the model's persistent storage location. For example:
/path/to/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/copied_env
The following command can be used as a reference (replace paths as necessary):
sudo cp /$USR/tt-inference-server/vllm-tt-metal-llama3-70b/.env /$USR/tt_studio/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/.env
The VLLM_LLAMA31_ENV_FILE
variable within the TT-Studio $USR/tt-studio/app/.env
file must point to this copied .env
file. This should be a relative path, for example it can be set as follows:
VLLM_LLAMA31_ENV_FILE="/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/.env"
After copying the .env
file, update the VLLM_LLAMA31_ENV_FILE
variable in the tt-studio/app/.env
file to point to the copied file path. This ensures TT-Studio uses the correct environment configuration for the model.
VLLM_LLAMA31_ENV_FILE="/path/to/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/copied_env"
Here is an example of a complete .env
file configuration for reference:
TT_STUDIO_ROOT=/Users/**username**/tt-studio
HOST_PERSISTENT_STORAGE_VOLUME=${TT_STUDIO_ROOT}/tt_studio_persistent_volume
INTERNAL_PERSISTENT_STORAGE_VOLUME=/tt_studio_persistent_volume
BACKEND_API_HOSTNAME="tt-studio-backend-api"
VLLM_LLAMA31_ENV_FILE="/path/to/tt_studio_persistent_volume/volume_id_tt-metal-llama-3.1-70b-instructv0.0.1/**copied_env
# SECURITY WARNING: keep these secret in production!
JWT_SECRET=test-secret-456
DJANGO_SECRET_KEY=django-insecure-default
HF_TOKEN=hf_****
- Start TT-Studio: Run TT-Studio using the startup command.
- Access Model Weights: In the TT-Studio interface, navigate to the model weights section.
- Select Custom Weights: Use the custom weights option to select the weights for Llama3.1-70B.
- Run the Model: Start the model and wait for it to initialize.
To view real-time logs from the container, use the following command:
docker logs -f <container_id>
During container initialization, you may encounter log entries like the following, which indicate that the vLLM server has started successfully:
INFO 12-11 08:10:36 tt_executor.py:67] # TT blocks: 2068, # CPU blocks: 0
INFO 12-11 08:10:36 tt_worker.py:66] Allocating kv caches
INFO 12-11 08:10:36 api_server.py:232] vLLM to use /tmp/tmp3ki28i0p as PROMETHEUS_MULTIPROC_DIR
INFO 12-11 08:10:36 launcher.py:19] Available routes are:
INFO 12-11 08:10:36 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-11 08:10:36 launcher.py:27] Route: /health, Methods: GET
INFO 12-11 08:10:36 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-11 08:10:36 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-11 08:10:36 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-11 08:10:36 launcher.py:27] Route: /version, Methods: GET
INFO 12-11 08:10:36 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-11 08:10:36 launcher.py:27] Route: /v1/completions, Methods: POST
INFO: Application startup complete.
To access the container's shell for debugging or manual inspection, use the following command:
docker exec -it <container_id> bash
Use env
to check environment variables or run commands directly to inspect the environment. To verify if the server is running properly, you can attempt to manually start it by running:
python ***_vllm_api_server.py
This will allow you to check for any startup errors or issues directly from the container's shell.
curl -s --no-buffer -X POST "http://localhost:7000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" -d '{"model":"meta-llama/Llama-3.1-70B-Instruct","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Hi"}]}'
If successful, you will receive a response from the model.
With the setup complete, you’re ready to run inference on the vLLM models (or any other supported model(s)) within TT-Studio. Refer to the documentation and setup instructions in the repositories for further guidance.