Streaming API returns malformed response chunks with unwrapped ':\n\n' during high concurrency testing #264

pziecina-nv · 2025-02-25T13:04:52Z

I'm observing issue with receiving responses not compliant with the OpenAI API schema while benchmarking monolithic setup with higher concurrency.

Steps to reproduce:

Run server

set -Eexumo pipefail

export SERVICE_HOST=127.0.0.1
export SERVICE_PORT=9992
export SERVED_MODEL_NAME="hf_meta-llama_Llama-3.1-70B-Instruct"

trap 'trap - SIGINT SIGTERM EXIT; kill -term -- -$SERVER_PGID || true' SIGINT SIGTERM EXIT

export VLLM_NO_USAGE_STATS=1
export VLLM_FLASH_ATTN_VERSION=2
export ETCD_ENDPOINTS="http://127.0.0.1:2379"
export RUST_LOG=debug
export TRD_LOG=debug
export VLLM_LOGGING_LEVEL=DEBUG

nats-server -js -p 4222 -m 8222 &
etcd \
    --listen-peer-urls http://127.0.0.1:2380 \
    --listen-client-urls $ETCD_ENDPOINTS \
    --advertise-client-urls $ETCD_ENDPOINTS &
sleep 3
http --host 0.0.0.0 --port 9992 &
sleep 1
llmctl http add chat-models $SERVED_MODEL_NAME triton-init.vllm.generate

server_args="--swap-space 16 --gpu-memory-utilization 0.8 "
cd /workspace/examples/python_rs/llm/vllm
python3 -m monolith.worker \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 1 \
$server_args \
--load-format auto \
--disable-log-stats \
--disable-log-requests

Run benchmark

genai-perf profile -m hf_meta-llama_Llama-3.1-70B-Instruct \
  --tokenizer meta-llama/Llama-3.1-70B-Instruct \
  --service-kind openai \
  --endpoint-type chat \
  --url 127.0.0.1:9992 \
  --streaming \
  --concurrency 64 \
  --num-dataset-entries 100 \
  --warmup-request-count 2 \
  --request-count 320 \
  --synthetic-input-tokens-mean 3000 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 150 \
  --output-tokens-stddev 0 \
  --extra-inputs min_tokens:150 \
  --extra-inputs max_tokens:150 \
  --extra-inputs ignore_eos:true \
  --random-seed 0 \
  --artifact-dir $PWD \
  --profile-export-file profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json -- --max-threads 64

Observe:

genai-perf fails with the following error:

genai-perf profile -m hf_meta-llama_Llama-3.1-70B-Instruct   --tokenizer /jet-artifacts/model/llama3.1_70b_pyt/safetensors_mode-instruct/hf-1d54af3-nim1.2_bf16   --service-kind openai   --endpoint-type chat   --url 127.0.0.1:9992   --streaming   --concurrency 64   --num-dataset-entries 100   --warmup-request-count 2   --request-count 320   --synthetic-input-tokens-mean 3000   --synthetic-
input-tokens-stddev 0   --output-tokens-mean 150   --output-tokens-stddev 0   --extra-inputs min_tokens:150   --extra-inputs max_tokens:150   --extra-inputs ignore_eos:true   --random-seed 0   --artifact-dir $PWD   --profile-export-file profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json -- --max-threads 64                                                                                                                                                                 2025-02-25 04:55 [INFO] genai_perf.parser:1093 - Detected passthrough args: ['--max-threads', '64']
2025-02-25 04:55 [INFO] genai_perf.parser:112 - Profiling these models: hf_meta-llama_Llama-3.1-70B-Instruct
2025-02-25 04:55 [INFO] genai_perf.subcommand.common:225 - Running Perf Analyzer : 'perf_analyzer -m hf_meta-llama_Llama-3.1-70B-Instruct --async --input-data /tmp/genai-perf/inputs.json -i http --concurrency-range 64 --endpoint v1/chat/completions --service-kind openai -u 127.0.0.1:9992 --request-count 320 --warmup-request-count 2 --profile-export-file /tmp/genai-perf/profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json --measurement-interval 10000 --stability-percentage 999 --max-threads 64'
WARNING: The number of concurrent requests exceeds the warmup request count. Adjusting warmup concurrency to match the warmup request count.
2025-02-25 04:57 [INFO] genai_perf.profile_data_parser.profile_data_parser:64 - Loading response data from '/tmp/genai-perf/profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json'
2025-02-25 04:57 [INFO] genai_perf.profile_data_parser.llm_profile_data_parser:94 - Parsing 320 requests
Parsing Requests:  16%|███████████████████████████████████████████████████████████████████▎                                                                                                                                                                                                                                                                                                                                                                           | 50/320

[00:00<00:02, 98.24req/s]2025-02-25 04:57 [ERROR] genai_perf.utils:93 - Failed to parse JSON string: '[DONE]'

Parsing Requests:  16%|███████████████████████████████████████████████████████████████████▎                                                                                                                                                                                                                                                                                                                                                                           | 50/320 [00:00<00:02, 95.51req/s]
Traceback (most recent call last):
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/main.py", line 54, in main
    run()
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/main.py", line 47, in run
    args.func(args, extra_args)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/subcommand/profile.py", line 69, in profile_handler
    data_parser = calculate_metrics(args, tokenizer)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/subcommand/common.py", line 83, in calculate_metrics
    return LLMProfileDataParser(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/llm_profile_data_parser.py", line 80, in __init__
    super().__init__(filename, goodput_constraints)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/profile_data_parser.py", line 67, in __init__
    self._parse_profile_data(data)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/profile_data_parser.py", line 132, in _parse_profile_data
    metrics = self._parse_requests(requests)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/llm_profile_data_parser.py", line 102, in _parse_requests
    self._preprocess_response(res_timestamps, res_outputs)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/llm_profile_data_parser.py", line 225, in _preprocess_response
    merged_response = load_json_str(remove_sse_prefix(responses[0]))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/utils.py", line 90, in load_json_str
    return func(json.loads(json_str))
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value

in profile json file (full logs in monolithic_logs.tar.gz) can be found the following entries:

"response_outputs": [
            {
              "response": ":\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"role\":\"assistant\",\"content\":\"\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\"You\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\"'ve\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            ...
            {
              "response": "data: {\"id\":\"bde83ecc-be07-434f-a54e-e3102dd0f944\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\":\\n\\n\"}}],\"created\":1740484823,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            ...
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\" speaker\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":\"length\",\"delta\":{\"content\":\" is\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: [DONE]\n\n"
            }
          ]

I've analyzed genai-perf codes and found that direct cause of this error is 1st response ":\n\n" which seems to be not compliant with the OpenAI API protocol. In server logs (full logs in monolithic_logs.tar.gz) can be found the following entries:

INFO 02-24 21:25:27 engine.py:275] Added request 5acf6662-8705-4b18-a51c-db252dfaaa89.
DEBUG 02-24 21:25:27 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'role': 'assistant', 'content': ''}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'This'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' is'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' a'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' collection'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' of'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Son'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'nets'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' by'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' William'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Shakespeare'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ','}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' specifically'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' from'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' his'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Fair'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Youth'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' sequence'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '.'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Here'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': "'s"}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' a'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' brief'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' summary'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' of'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' each'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' son'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'net'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ':\n\n'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '**'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'Son'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'net'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' '}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '1'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '**\n'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'The'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' speaker'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' l'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'aments'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' that'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' the'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' youth'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': "'s"}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' beauty'}, 'logprobs': None, 'finish_reason': None}]}

Seems that the :\n\n is present in vllm engine output, but in middle of the response. On perf_analyzer output we can observe this token in the middle and the beginning of the response (without wrapping in json object).

Using latest main codes, genai-perf r25.02 release with tritonclient 2.54.0.

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2025-02-25T19:43:42Z

CC @nv-hwoo looks like we will need those SSE improvements to genai-perf: triton-inference-server/perf_analyzer#176

ryanolson · 2025-02-25T19:44:59Z

This was supposed to be fixed in genai perf.

This is a valid SSE response used to keep-alive a stream that hasn't had output within a timeout.

ganeshku1 · 2025-02-25T20:19:10Z

@nv-hwoo is taking a look at this issue and is working on this as high priority.

nv-hwoo · 2025-02-26T01:08:38Z

@pziecina-nv I've made some adjustments to genai-perf to fix the issue in this PR and confirmed that the same profile export json file is now being parsed without error. Currently waiting for internal build job to finish to get the wheel.

ganeshku1 · 2025-02-26T03:41:14Z

Thank you @nv-hwoo for the quick fix.
From slack thread: https://nvidia.slack.com/archives/C06850J381Y/p1740531812887359?thread_ts=1740508839.290399&cid=C06850J381Y
The build is finished and you can download the wheel with curl:
ACCESS_TOKEN=my_gitlab_token
BUILD_JOB_ID=144829946

curl -L -o artifacts.zip
--header "PRIVATE-TOKEN: $ACCESS_TOKEN"
https://gitlab-master.nvidia.com/api/v4/projects/129834/jobs/$BUILD_JOB_ID/artifacts

unzip artifacts.zip
pip install artifacts/genai_perf-0.0.11.dev0-py3-none-any.whl

nv-hwoo · 2025-02-26T06:20:58Z

PR has been merged to main now. You can also try pip installing the latest genai-perf with the following command:

pip install git+https://github.com/triton-inference-server/perf_analyzer.git@main#subdirectory=genai-perf

pziecina-nv · 2025-02-26T08:10:18Z

This resolves the issue. Thanks for quick fix!

piotrm-nvidia · 2025-02-26T11:09:54Z

#276

Pls review fix necessary in Triton Distributed.

nv-hwoo · 2025-02-26T16:18:34Z

@piotrm-nvidia I don't have the "approve" button but the fix looks good to me.

piotrm-nvidia mentioned this issue Feb 25, 2025

Handle sse comments and non-data events triton-inference-server/perf_analyzer#176

Merged

ganeshku1 assigned nv-hwoo Feb 25, 2025

pziecina-nv closed this as completed Feb 26, 2025

piotrm-nvidia reopened this Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming API returns malformed response chunks with unwrapped ':\n\n' during high concurrency testing #264

Streaming API returns malformed response chunks with unwrapped ':\n\n' during high concurrency testing #264

pziecina-nv commented Feb 25, 2025 •

edited by piotrm-nvidia

Loading

rmccorm4 commented Feb 25, 2025

ryanolson commented Feb 25, 2025

ganeshku1 commented Feb 25, 2025

nv-hwoo commented Feb 26, 2025

ganeshku1 commented Feb 26, 2025

nv-hwoo commented Feb 26, 2025

pziecina-nv commented Feb 26, 2025

piotrm-nvidia commented Feb 26, 2025

nv-hwoo commented Feb 26, 2025

Streaming API returns malformed response chunks with unwrapped ':\n\n' during high concurrency testing #264

Streaming API returns malformed response chunks with unwrapped ':\n\n' during high concurrency testing #264

Comments

pziecina-nv commented Feb 25, 2025 • edited by piotrm-nvidia Loading

rmccorm4 commented Feb 25, 2025

ryanolson commented Feb 25, 2025

ganeshku1 commented Feb 25, 2025

nv-hwoo commented Feb 26, 2025

ganeshku1 commented Feb 26, 2025

nv-hwoo commented Feb 26, 2025

pziecina-nv commented Feb 26, 2025

piotrm-nvidia commented Feb 26, 2025

nv-hwoo commented Feb 26, 2025

pziecina-nv commented Feb 25, 2025 •

edited by piotrm-nvidia

Loading