-
Notifications
You must be signed in to change notification settings - Fork 14
Streaming API returns malformed response chunks with unwrapped ':\n\n' during high concurrency testing #264
Comments
CC @nv-hwoo looks like we will need those SSE improvements to genai-perf: triton-inference-server/perf_analyzer#176 |
This was supposed to be fixed in genai perf. This is a valid SSE response used to keep-alive a stream that hasn't had output within a timeout. |
@nv-hwoo is taking a look at this issue and is working on this as high priority. |
@pziecina-nv I've made some adjustments to genai-perf to fix the issue in this PR and confirmed that the same profile export json file is now being parsed without error. Currently waiting for internal build job to finish to get the wheel. |
Thank you @nv-hwoo for the quick fix. curl -L -o artifacts.zip unzip artifacts.zip |
PR has been merged to main now. You can also try pip installing the latest genai-perf with the following command: pip install git+https://github.com/triton-inference-server/perf_analyzer.git@main#subdirectory=genai-perf |
This resolves the issue. Thanks for quick fix! |
Pls review fix necessary in Triton Distributed. |
@piotrm-nvidia I don't have the "approve" button but the fix looks good to me. |
I'm observing issue with receiving responses not compliant with the OpenAI API schema while benchmarking monolithic setup with higher concurrency.
Steps to reproduce:
genai-perf profile -m hf_meta-llama_Llama-3.1-70B-Instruct \ --tokenizer meta-llama/Llama-3.1-70B-Instruct \ --service-kind openai \ --endpoint-type chat \ --url 127.0.0.1:9992 \ --streaming \ --concurrency 64 \ --num-dataset-entries 100 \ --warmup-request-count 2 \ --request-count 320 \ --synthetic-input-tokens-mean 3000 \ --synthetic-input-tokens-stddev 0 \ --output-tokens-mean 150 \ --output-tokens-stddev 0 \ --extra-inputs min_tokens:150 \ --extra-inputs max_tokens:150 \ --extra-inputs ignore_eos:true \ --random-seed 0 \ --artifact-dir $PWD \ --profile-export-file profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json -- --max-threads 64
Observe:
genai-perf fails with the following error:
in profile json file (full logs in monolithic_logs.tar.gz) can be found the following entries:
I've analyzed genai-perf codes and found that direct cause of this error is 1st response ":\n\n" which seems to be not compliant with the OpenAI API protocol. In server logs (full logs in monolithic_logs.tar.gz) can be found the following entries:
Seems that the
:\n\n
is present in vllm engine output, but in middle of the response. On perf_analyzer output we can observe this token in the middle and the beginning of the response (without wrapping in json object).Using latest main codes, genai-perf r25.02 release with tritonclient 2.54.0.
The text was updated successfully, but these errors were encountered: