Skip to content

Latest commit

 

History

History
120 lines (98 loc) · 4.74 KB

README.md

File metadata and controls

120 lines (98 loc) · 4.74 KB

Benchmarking

vLLM offline benchmarking

The vLLM benchmarking script is https://github.com/tenstorrent/vllm/blob/dev/examples/offline_inference_tt.py

It is recommended to run the vLLM model implementation via docker run, at tt-inference-server/vllm-tt-metal-llama3/README.md.

To measure performance for a single batch (with the default prompt length of 128 tokens):

export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 128 --max_tokens 128
# for example, changing to input 2048, output 2048
python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --perf_prompt_len 2048 --max_tokens 2048

Command Line Arguments

  • --prompts_json (default: "tt_metal/prompts.json"):

    • Path to prompts JSON file used for inference. Prompts should be in a list format. This will not be used if measure_perf is set.
  • --measure_perf:

    • Measure model performance using synthetic inputs. If enabled, any provided prompts_json is ignored, and dummy prompts are used instead for benchmarking.
  • --perf_prompt_len (default: 128):

    • Length of dummy prompts (in tokens) for benchmarking. Used only when --measure_perf is provided.
  • --max_tokens (default: 128):

    • Maximum output length (in tokens) generated by the model for each prompt.
  • --greedy_sampling:

    • Use greedy decoding instead of probabilistic sampling (top-k/top-p). Greedy sampling always selects the token with the highest probability, leading to more deterministic output.
  • --max_seqs_in_batch (default: 32):

    • Maximum batch size for inference, determining the number of prompts processed in parallel.

Online Benchmarking

single user

python utils/prompt_client_cli.py \
    --num_prompts 32 \
    --max_concurrent 1 \
    --tokenizer_model meta-llama/Llama-3.1-70B-Instruct \
    --max_prompt_length 128 \
    --input_seq_len 128 \
    --output_seq_len 128 \
    --template chat_template \
    --dataset random

using vllm/benchmarking/benchmark_serving.py

Within the Docker container, use the benchmark_serving.patch file:

cd ~/app/src
python run_vllm_api_server.py

This simply stops the benchmarking script from sending the best_of arg which is not supported and causes issues.

To run the benchmarks, in another shell into the Docker container:

cd ~/vllm
git apply ~/app/benchmarking/benchmark_serving.patch
cd ~/app
export PYTHONPATH=$PYTHONPATH:$PWD
python benchmarking/vllm_online_benchmark.py

The output will be available for each input/output sequence length defined and time stamped.

Results are also printed to stdout, for example with mock data results:

==================================================
                    Benchmark Result                     
==================================================
Successful requests:                     32
Benchmark duration (s):                  0.39
Total input tokens:                      4096
Total generated tokens:                  64
Request throughput (req/s):              83.04
Output token throughput (tok/s):         166.07
Total Token throughput (tok/s):          10794.77
--------------------------------------------------
               Time to First Token                  
--------------------------------------------------
Mean TTFT (ms):                          358.26
Median TTFT (ms):                        358.45
P99 TTFT (ms):                           361.67
--------------------------------------------------
     Time per Output Token (excl. 1st token)       
--------------------------------------------------
Mean TPOT (ms):                          14.03
Median TPOT (ms):                        14.13
P99 TPOT (ms):                           14.30
--------------------------------------------------
             Inter-token Latency                   
--------------------------------------------------
Mean ITL (ms):                           7.86
Median ITL (ms):                         7.83
P99 ITL (ms):                            8.05
==================================================

using tt-inference-server/benchmarking/prompt_client_online_benchmark.py

export PYTHONPATH=$PYTHONPATH:$PWD
python benchmarking/prompt_client_online_benchmark.py

Benchmark summary

Generate a markdown table and .csv output file from multiple benchmarking runs:

# for vllm_online_benchmark.py
python benchmarking/benchmark_summary.py ~/cache_root/vllm_online_benchmark_results/results_2025-01-17_17-19-28  --output-dir ./vllm_results_summary
# or for prompt_client_online_benchmarking.py
python benchmarking/benchmark_summary.py ~/cache_root/online_benchmark_results/results_2025-01-15_20-58-57 --output-dir ./results_summary