-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow traffic split across model with user provided ratios #7
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of changes in the PR description
benchmark_serving.py
Outdated
overall_results["latencies"].append(latency) | ||
if ttft: | ||
overall_results["ttfts"].append(ttft) | ||
overall_results["tpots"].append((latency[2] - ttft) / (latency[1] - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little confusing to understand that this is req_latency - ttft / output_tokens - 1
since we are choosing different array indexes. Can we either change it to dict or add a comment above explaining what latency[2] and latency[1] are? We can also create separate variables for these which would make it easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. Done
latency_throughput_curve.sh
Outdated
echo "TOTAL prompts: $num_prompts" | ||
|
||
# Build the python command options | ||
PYTHON_OPTS="$PYTHON_OPTS --save-json-results --output-bucket=$OUTPUT_BUCKET --host=$IP --port=$PORT --dataset=$PROMPT_DATASET_FILE --tokenizer=$TOKENIZER --request-rate=$request_rate --backend=$BACKEND --num-prompts=$num_prompts --max-input-length=$INPUT_LENGTH --max-output-length=$OUTPUT_LENGTH --file-prefix=$FILE_PREFIX --models=$MODELS --traffic-split=$TRAFFIC_SPLIT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified by not passing in --traffic-split
that it works the same as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes just added.
|
||
benchmark_duration_all_models = time.time() - benchmark_start_time | ||
if args.save_aggregated_result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were we using this for anything specific before? Are we missing anything by removing it? cc @liu-cong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not using this, relying on individual model metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. You might need to rebase.
This pull request introduces significant updates to the benchmarking process in
benchmark_serving.py
and modifies thelatency_throughput_curve.sh
script to support traffic splitting across multiple models. The most important changes include the addition of traffic split functionality, refactoring of the benchmarking function, and updates to the script for handling new arguments.Benchmarking improvements:
run_single_request
to handle individual requests with model selection based on traffic split.benchmark
function to support model selection per request and save results separately for each model.main
function to call the refactoredbenchmark
function with traffic split arguments.Script updates:
--traffic-split
argument to theargparse
parser inbenchmark_serving.py
to specify traffic split proportions.latency_throughput_curve.sh
to include the--traffic-split
argument in the Python command options. [1] [2]removed save_aggregated_results:
This does not seem to be reliable as even with 1 model the overall metrics does not agree with individual model metrics with significant delta (> 10%). Did not remove flag as existing scripts might fail.
Minor fixes:
latency_throughput_curve.sh
script from "Benchmaking" to "Benchmarking".Tested with:
prod-v1-base --models=meta-llama/Llama-2-7b-hf --output-bucket=kaushikmitra-llm-ig-benchmark --save-aggregated-result --output-bucket-filepath t --traffic-split=1