Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow traffic split across model with user provided ratios #7

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kaushikmitr
Copy link
Collaborator

@kaushikmitr kaushikmitr commented Mar 6, 2025

This pull request introduces significant updates to the benchmarking process in benchmark_serving.py and modifies the latency_throughput_curve.sh script to support traffic splitting across multiple models. The most important changes include the addition of traffic split functionality, refactoring of the benchmarking function, and updates to the script for handling new arguments.

Benchmarking improvements:

  • Added a new function run_single_request to handle individual requests with model selection based on traffic split.
  • Refactored the benchmark function to support model selection per request and save results separately for each model.
  • Updated the main function to call the refactored benchmark function with traffic split arguments.

Script updates:

  • Added --traffic-split argument to the argparse parser in benchmark_serving.py to specify traffic split proportions.
  • Modified latency_throughput_curve.sh to include the --traffic-split argument in the Python command options. [1] [2]

removed save_aggregated_results:

This does not seem to be reliable as even with 1 model the overall metrics does not agree with individual model metrics with significant delta (> 10%). Did not remove flag as existing scripts might fail.

Minor fixes:

  • Corrected a typo in the latency_throughput_curve.sh script from "Benchmaking" to "Benchmarking".

Tested with:

  1. three model with traffic split: python3 benchmark_serving.py --save-json-results --host=35.240.228.178 --port=8000 --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Llama-2-7b-hf --request-rate=2 --backend=vllm --num-prompts=30 --max-input-length=1024 --max-output-length=1024 --file-prefix=lgp-ig-prod-v1-base --models=meta-llama/Llama-2-7b-hf,tweet-summary-0,tweet-summary-1 --output-bucket=kaushikmitra-llm-ig-benchmark --save-aggregated-result --output-bucket-filepath t --traffic-split=0.7,0.2,0.1
  2. one model: python3 benchmark_serving.py --save-json-results --host=35.240.228.178 --port=8000 --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Llama-2-7b-hf --request-rate=2 --backend=vllm --num-prompts=10 --max-input-length=1024 --max-output-length=1024 --file-prefix=lgp-ig-
    prod-v1-base --models=meta-llama/Llama-2-7b-hf --output-bucket=kaushikmitra-llm-ig-benchmark --save-aggregated-result --output-bucket-filepath t --traffic-split=1
  3. three model with traffic split does not add to 1: Errors out

Copy link
Collaborator Author

@kaushikmitr kaushikmitr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of changes in the PR description

overall_results["latencies"].append(latency)
if ttft:
overall_results["ttfts"].append(ttft)
overall_results["tpots"].append((latency[2] - ttft) / (latency[1] - 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little confusing to understand that this is req_latency - ttft / output_tokens - 1 since we are choosing different array indexes. Can we either change it to dict or add a comment above explaining what latency[2] and latency[1] are? We can also create separate variables for these which would make it easier to follow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. Done

echo "TOTAL prompts: $num_prompts"

# Build the python command options
PYTHON_OPTS="$PYTHON_OPTS --save-json-results --output-bucket=$OUTPUT_BUCKET --host=$IP --port=$PORT --dataset=$PROMPT_DATASET_FILE --tokenizer=$TOKENIZER --request-rate=$request_rate --backend=$BACKEND --num-prompts=$num_prompts --max-input-length=$INPUT_LENGTH --max-output-length=$OUTPUT_LENGTH --file-prefix=$FILE_PREFIX --models=$MODELS --traffic-split=$TRAFFIC_SPLIT"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified by not passing in --traffic-split that it works the same as before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes just added.


benchmark_duration_all_models = time.time() - benchmark_start_time
if args.save_aggregated_result:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were we using this for anything specific before? Are we missing anything by removing it? cc @liu-cong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not using this, relying on individual model metrics

Copy link
Collaborator

@achandrasekar achandrasekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. You might need to rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants