allow traffic split across model with user provided ratios #7

kaushikmitr · 2025-03-06T23:55:16Z

This pull request introduces significant updates to the benchmarking process in benchmark_serving.py and modifies the latency_throughput_curve.sh script to support traffic splitting across multiple models. The most important changes include the addition of traffic split functionality, refactoring of the benchmarking function, and updates to the script for handling new arguments.

Benchmarking improvements:

Added a new function run_single_request to handle individual requests with model selection based on traffic split.
Refactored the benchmark function to support model selection per request and save results separately for each model.
Updated the main function to call the refactored benchmark function with traffic split arguments.

Script updates:

Added --traffic-split argument to the argparse parser in benchmark_serving.py to specify traffic split proportions.
Modified latency_throughput_curve.sh to include the --traffic-split argument in the Python command options. [1] [2]

removed save_aggregated_results:

This does not seem to be reliable as even with 1 model the overall metrics does not agree with individual model metrics with significant delta (> 10%). Did not remove flag as existing scripts might fail.

Minor fixes:

Corrected a typo in the latency_throughput_curve.sh script from "Benchmaking" to "Benchmarking".

Tested with:

three model with traffic split: python3 benchmark_serving.py --save-json-results --host=35.240.228.178 --port=8000 --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Llama-2-7b-hf --request-rate=2 --backend=vllm --num-prompts=30 --max-input-length=1024 --max-output-length=1024 --file-prefix=lgp-ig-prod-v1-base --models=meta-llama/Llama-2-7b-hf,tweet-summary-0,tweet-summary-1 --output-bucket=kaushikmitra-llm-ig-benchmark --save-aggregated-result --output-bucket-filepath t --traffic-split=0.7,0.2,0.1
one model: python3 benchmark_serving.py --save-json-results --host=35.240.228.178 --port=8000 --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Llama-2-7b-hf --request-rate=2 --backend=vllm --num-prompts=10 --max-input-length=1024 --max-output-length=1024 --file-prefix=lgp-ig-
prod-v1-base --models=meta-llama/Llama-2-7b-hf --output-bucket=kaushikmitra-llm-ig-benchmark --save-aggregated-result --output-bucket-filepath t --traffic-split=1
three model with traffic split does not add to 1: Errors out

kaushikmitr

list of changes in the PR description

achandrasekar · 2025-03-07T02:05:09Z

benchmark_serving.py

+        overall_results["latencies"].append(latency)
+        if ttft:
+            overall_results["ttfts"].append(ttft)
+            overall_results["tpots"].append((latency[2] - ttft) / (latency[1] - 1))


It is a little confusing to understand that this is req_latency - ttft / output_tokens - 1 since we are choosing different array indexes. Can we either change it to dict or add a comment above explaining what latency[2] and latency[1] are? We can also create separate variables for these which would make it easier to follow.

Yes, agreed. Done

achandrasekar · 2025-03-07T02:08:47Z

latency_throughput_curve.sh

+  echo "TOTAL prompts: $num_prompts"
+
+  # Build the python command options
+  PYTHON_OPTS="$PYTHON_OPTS --save-json-results --output-bucket=$OUTPUT_BUCKET --host=$IP --port=$PORT --dataset=$PROMPT_DATASET_FILE --tokenizer=$TOKENIZER --request-rate=$request_rate --backend=$BACKEND --num-prompts=$num_prompts --max-input-length=$INPUT_LENGTH --max-output-length=$OUTPUT_LENGTH --file-prefix=$FILE_PREFIX --models=$MODELS --traffic-split=$TRAFFIC_SPLIT"


Have you verified by not passing in --traffic-split that it works the same as before?

Yes just added.

achandrasekar · 2025-03-07T02:10:05Z

benchmark_serving.py


-  benchmark_duration_all_models = time.time() - benchmark_start_time
-  if args.save_aggregated_result:


Were we using this for anything specific before? Are we missing anything by removing it? cc @liu-cong

I was not using this, relying on individual model metrics

achandrasekar

Looks good. You might need to rebase.

allow traffic split across model with user provided ratios

1ea794f

kaushikmitr commented Mar 7, 2025

View reviewed changes

achandrasekar reviewed Mar 7, 2025

View reviewed changes

kaushikmitr added 2 commits March 7, 2025 18:50

allow no traffic-split flag

d150129

allow no traffic-split flag in shell script

e7fa5eb

achandrasekar approved these changes Mar 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow traffic split across model with user provided ratios #7

allow traffic split across model with user provided ratios #7

kaushikmitr commented Mar 6, 2025 •

edited

Loading

kaushikmitr left a comment

achandrasekar Mar 7, 2025

kaushikmitr Mar 7, 2025

achandrasekar Mar 7, 2025

kaushikmitr Mar 7, 2025

achandrasekar Mar 7, 2025

kaushikmitr Mar 7, 2025

achandrasekar left a comment


		benchmark_duration_all_models = time.time() - benchmark_start_time
		if args.save_aggregated_result:

allow traffic split across model with user provided ratios #7

Are you sure you want to change the base?

allow traffic split across model with user provided ratios #7

Conversation

kaushikmitr commented Mar 6, 2025 • edited Loading

Benchmarking improvements:

Script updates:

removed save_aggregated_results:

Minor fixes:

Tested with:

kaushikmitr left a comment

Choose a reason for hiding this comment

achandrasekar Mar 7, 2025

Choose a reason for hiding this comment

kaushikmitr Mar 7, 2025

Choose a reason for hiding this comment

achandrasekar Mar 7, 2025

Choose a reason for hiding this comment

kaushikmitr Mar 7, 2025

Choose a reason for hiding this comment

achandrasekar Mar 7, 2025

Choose a reason for hiding this comment

kaushikmitr Mar 7, 2025

Choose a reason for hiding this comment

achandrasekar left a comment

Choose a reason for hiding this comment

kaushikmitr commented Mar 6, 2025 •

edited

Loading