The vLLM benchmarking script is https://github.com/tenstorrent/vllm/blob/dev/examples/offline_inference_tt.py
It is recommended to run the vLLM model implementation via docker run
, at tt-inference-server/vllm-tt-metal-llama3/README.md.
To measure performance for a single batch (with the default prompt length of 128 tokens):
export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 128 --max_tokens 128
# for example, changing to input 2048, output 2048
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 2048 --max_tokens 2048
-
--prompts_json
(default:"tt_metal/prompts.json"
):- Path to prompts JSON file used for inference. Prompts should be in a list format. This will not be used if
measure_perf
is set.
- Path to prompts JSON file used for inference. Prompts should be in a list format. This will not be used if
-
--measure_perf
:- Measure model performance using synthetic inputs. If enabled, any provided
prompts_json
is ignored, and dummy prompts are used instead for benchmarking.
- Measure model performance using synthetic inputs. If enabled, any provided
-
--perf_prompt_len
(default:128
):- Length of dummy prompts (in tokens) for benchmarking. Used only when
--measure_perf
is provided.
- Length of dummy prompts (in tokens) for benchmarking. Used only when
-
--max_tokens
(default:128
):- Maximum output length (in tokens) generated by the model for each prompt.
-
--greedy_sampling
:- Use greedy decoding instead of probabilistic sampling (top-k/top-p). Greedy sampling always selects the token with the highest probability, leading to more deterministic output.
-
--max_seqs_in_batch
(default:32
):- Maximum batch size for inference, determining the number of prompts processed in parallel.
python utils/prompt_client_cli.py \
--num_prompts 32 \
--max_concurrent 1 \
--tokenizer_model meta-llama/Llama-3.1-70B-Instruct \
--max_prompt_length 128 \
--input_seq_len 128 \
--output_seq_len 128 \
--template chat_template \
--dataset random
Within the Docker container, use the benchmark_serving.patch file:
cd ~/app/src
python run_vllm_api_server.py
This simply stops the benchmarking script from sending the best_of
arg which is not supported and causes issues.
To run the benchmarks, in another shell into the Docker container:
cd ~/vllm
git apply ~/app/benchmarking/benchmark_serving.patch
cd ~/app
export PYTHONPATH=$PYTHONPATH:$PWD
python benchmarking/vllm_online_benchmark.py
The output will be available for each input/output sequence length defined and time stamped.
Results are also printed to stdout, for example with mock data results:
==================================================
Benchmark Result
==================================================
Successful requests: 32
Benchmark duration (s): 0.39
Total input tokens: 4096
Total generated tokens: 64
Request throughput (req/s): 83.04
Output token throughput (tok/s): 166.07
Total Token throughput (tok/s): 10794.77
--------------------------------------------------
Time to First Token
--------------------------------------------------
Mean TTFT (ms): 358.26
Median TTFT (ms): 358.45
P99 TTFT (ms): 361.67
--------------------------------------------------
Time per Output Token (excl. 1st token)
--------------------------------------------------
Mean TPOT (ms): 14.03
Median TPOT (ms): 14.13
P99 TPOT (ms): 14.30
--------------------------------------------------
Inter-token Latency
--------------------------------------------------
Mean ITL (ms): 7.86
Median ITL (ms): 7.83
P99 ITL (ms): 8.05
==================================================
export PYTHONPATH=$PYTHONPATH:$PWD
python benchmarking/prompt_client_online_benchmark.py
Generate a markdown table and .csv output file from multiple benchmarking runs:
# for vllm_online_benchmark.py
python benchmarking/benchmark_summary.py ~/cache_root/vllm_online_benchmark_results/results_2025-01-17_17-19-28 --output-dir ./vllm_results_summary
# or for prompt_client_online_benchmarking.py
python benchmarking/benchmark_summary.py ~/cache_root/online_benchmark_results/results_2025-01-15_20-58-57 --output-dir ./results_summary