Skip to content
This repository was archived by the owner on Mar 8, 2025. It is now read-only.

Streaming API returns malformed response chunks with unwrapped ':\n\n' during high concurrency testing #264

Open
pziecina-nv opened this issue Feb 25, 2025 · 9 comments
Assignees

Comments

@pziecina-nv
Copy link

pziecina-nv commented Feb 25, 2025

I'm observing issue with receiving responses not compliant with the OpenAI API schema while benchmarking monolithic setup with higher concurrency.

Steps to reproduce:

  1. Run server
set -Eexumo pipefail

export SERVICE_HOST=127.0.0.1
export SERVICE_PORT=9992
export SERVED_MODEL_NAME="hf_meta-llama_Llama-3.1-70B-Instruct"

trap 'trap - SIGINT SIGTERM EXIT; kill -term -- -$SERVER_PGID || true' SIGINT SIGTERM EXIT

export VLLM_NO_USAGE_STATS=1
export VLLM_FLASH_ATTN_VERSION=2
export ETCD_ENDPOINTS="http://127.0.0.1:2379"
export RUST_LOG=debug
export TRD_LOG=debug
export VLLM_LOGGING_LEVEL=DEBUG

nats-server -js -p 4222 -m 8222 &
etcd \
    --listen-peer-urls http://127.0.0.1:2380 \
    --listen-client-urls $ETCD_ENDPOINTS \
    --advertise-client-urls $ETCD_ENDPOINTS &
sleep 3
http --host 0.0.0.0 --port 9992 &
sleep 1
llmctl http add chat-models $SERVED_MODEL_NAME triton-init.vllm.generate

server_args="--swap-space 16 --gpu-memory-utilization 0.8 "
cd /workspace/examples/python_rs/llm/vllm
python3 -m monolith.worker \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 1 \
$server_args \
--load-format auto \
--disable-log-stats \
--disable-log-requests
  1. Run benchmark
genai-perf profile -m hf_meta-llama_Llama-3.1-70B-Instruct \
  --tokenizer meta-llama/Llama-3.1-70B-Instruct \
  --service-kind openai \
  --endpoint-type chat \
  --url 127.0.0.1:9992 \
  --streaming \
  --concurrency 64 \
  --num-dataset-entries 100 \
  --warmup-request-count 2 \
  --request-count 320 \
  --synthetic-input-tokens-mean 3000 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 150 \
  --output-tokens-stddev 0 \
  --extra-inputs min_tokens:150 \
  --extra-inputs max_tokens:150 \
  --extra-inputs ignore_eos:true \
  --random-seed 0 \
  --artifact-dir $PWD \
  --profile-export-file profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json -- --max-threads 64

Observe:

genai-perf fails with the following error:

genai-perf profile -m hf_meta-llama_Llama-3.1-70B-Instruct   --tokenizer /jet-artifacts/model/llama3.1_70b_pyt/safetensors_mode-instruct/hf-1d54af3-nim1.2_bf16   --service-kind openai   --endpoint-type chat   --url 127.0.0.1:9992   --streaming   --concurrency 64   --num-dataset-entries 100   --warmup-request-count 2   --request-count 320   --synthetic-input-tokens-mean 3000   --synthetic-
input-tokens-stddev 0   --output-tokens-mean 150   --output-tokens-stddev 0   --extra-inputs min_tokens:150   --extra-inputs max_tokens:150   --extra-inputs ignore_eos:true   --random-seed 0   --artifact-dir $PWD   --profile-export-file profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json -- --max-threads 64                                                                                                                                                                 2025-02-25 04:55 [INFO] genai_perf.parser:1093 - Detected passthrough args: ['--max-threads', '64']
2025-02-25 04:55 [INFO] genai_perf.parser:112 - Profiling these models: hf_meta-llama_Llama-3.1-70B-Instruct
2025-02-25 04:55 [INFO] genai_perf.subcommand.common:225 - Running Perf Analyzer : 'perf_analyzer -m hf_meta-llama_Llama-3.1-70B-Instruct --async --input-data /tmp/genai-perf/inputs.json -i http --concurrency-range 64 --endpoint v1/chat/completions --service-kind openai -u 127.0.0.1:9992 --request-count 320 --warmup-request-count 2 --profile-export-file /tmp/genai-perf/profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json --measurement-interval 10000 --stability-percentage 999 --max-threads 64'
WARNING: The number of concurrent requests exceeds the warmup request count. Adjusting warmup concurrency to match the warmup request count.
2025-02-25 04:57 [INFO] genai_perf.profile_data_parser.profile_data_parser:64 - Loading response data from '/tmp/genai-perf/profile_export-hf_meta_llama_Llama_3_1_70B_Instruct-openai-RAG-64.json'
2025-02-25 04:57 [INFO] genai_perf.profile_data_parser.llm_profile_data_parser:94 - Parsing 320 requests
Parsing Requests:  16%|███████████████████████████████████████████████████████████████████▎                                                                                                                                                                                                                                                                                                                                                                           | 50/320

[00:00<00:02, 98.24req/s]2025-02-25 04:57 [ERROR] genai_perf.utils:93 - Failed to parse JSON string: '[DONE]'

Parsing Requests:  16%|███████████████████████████████████████████████████████████████████▎                                                                                                                                                                                                                                                                                                                                                                           | 50/320 [00:00<00:02, 95.51req/s]
Traceback (most recent call last):
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/main.py", line 54, in main
    run()
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/main.py", line 47, in run
    args.func(args, extra_args)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/subcommand/profile.py", line 69, in profile_handler
    data_parser = calculate_metrics(args, tokenizer)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/subcommand/common.py", line 83, in calculate_metrics
    return LLMProfileDataParser(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/llm_profile_data_parser.py", line 80, in __init__
    super().__init__(filename, goodput_constraints)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/profile_data_parser.py", line 67, in __init__
    self._parse_profile_data(data)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/profile_data_parser.py", line 132, in _parse_profile_data
    metrics = self._parse_requests(requests)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/llm_profile_data_parser.py", line 102, in _parse_requests
    self._preprocess_response(res_timestamps, res_outputs)
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/profile_data_parser/llm_profile_data_parser.py", line 225, in _preprocess_response
    merged_response = load_json_str(remove_sse_prefix(responses[0]))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/jet/llm-benchmarks/.venv/lib/python3.12/site-packages/genai_perf/utils.py", line 90, in load_json_str
    return func(json.loads(json_str))
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value

in profile json file (full logs in monolithic_logs.tar.gz) can be found the following entries:

"response_outputs": [
            {
              "response": ":\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"role\":\"assistant\",\"content\":\"\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\"You\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\"'ve\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            ...
            {
              "response": "data: {\"id\":\"bde83ecc-be07-434f-a54e-e3102dd0f944\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\":\\n\\n\"}}],\"created\":1740484823,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            ...
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":null,\"delta\":{\"content\":\" speaker\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: {\"id\":\"4e7b009c-1935-48e0-b394-e1e2573560d7\",\"choices\":[{\"index\":0,\"finish_reason\":\"length\",\"delta\":{\"content\":\" is\"}}],\"created\":1740469453,\"model\":\"hf_meta-llama_Llama-3.1-70B-Instruct\",\"object\":\"chat.completion.chunk\",\"usage\":null,\"system_fingerprint\":null}\n\n"
            },
            {
              "response": "data: [DONE]\n\n"
            }
          ]

I've analyzed genai-perf codes and found that direct cause of this error is 1st response ":\n\n" which seems to be not compliant with the OpenAI API protocol. In server logs (full logs in monolithic_logs.tar.gz) can be found the following entries:

INFO 02-24 21:25:27 engine.py:275] Added request 5acf6662-8705-4b18-a51c-db252dfaaa89.
DEBUG 02-24 21:25:27 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'role': 'assistant', 'content': ''}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'This'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' is'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' a'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:28 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' collection'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' of'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Son'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'nets'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:29 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' by'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' William'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Shakespeare'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ','}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' specifically'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:30 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' from'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' his'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Fair'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Youth'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' sequence'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '.'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' Here'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': "'s"}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' a'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' brief'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' summary'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' of'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' each'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' son'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'net'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ':\n\n'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '**'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'Son'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'net'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' '}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '1'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': '**\n'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'The'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' speaker'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' l'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': 'aments'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' that'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' the'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' youth'}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': "'s"}, 'logprobs': None, 'finish_reason': None}]}
DEBUG 02-24 21:25:31 worker.py:69] Generated response: {'id': '5acf6662-8705-4b18-a51c-db252dfaaa89', 'object': 'chat.completion.chunk', 'created': 1740461127, 'model': 'hf_meta-llama_Llama-3.1-70B-Instruct', 'choices': [{'index': 0, 'delta': {'content': ' beauty'}, 'logprobs': None, 'finish_reason': None}]}

Seems that the :\n\n is present in vllm engine output, but in middle of the response. On perf_analyzer output we can observe this token in the middle and the beginning of the response (without wrapping in json object).

Using latest main codes, genai-perf r25.02 release with tritonclient 2.54.0.

@rmccorm4
Copy link
Contributor

CC @nv-hwoo looks like we will need those SSE improvements to genai-perf: triton-inference-server/perf_analyzer#176

@ryanolson
Copy link
Contributor

This was supposed to be fixed in genai perf.

This is a valid SSE response used to keep-alive a stream that hasn't had output within a timeout.

@ganeshku1
Copy link

@nv-hwoo is taking a look at this issue and is working on this as high priority.

@nv-hwoo
Copy link

nv-hwoo commented Feb 26, 2025

@pziecina-nv I've made some adjustments to genai-perf to fix the issue in this PR and confirmed that the same profile export json file is now being parsed without error. Currently waiting for internal build job to finish to get the wheel.

@ganeshku1
Copy link

Thank you @nv-hwoo for the quick fix.
From slack thread: https://nvidia.slack.com/archives/C06850J381Y/p1740531812887359?thread_ts=1740508839.290399&cid=C06850J381Y
The build is finished and you can download the wheel with curl:
ACCESS_TOKEN=my_gitlab_token
BUILD_JOB_ID=144829946

curl -L -o artifacts.zip
--header "PRIVATE-TOKEN: $ACCESS_TOKEN"
https://gitlab-master.nvidia.com/api/v4/projects/129834/jobs/$BUILD_JOB_ID/artifacts

unzip artifacts.zip
pip install artifacts/genai_perf-0.0.11.dev0-py3-none-any.whl

@nv-hwoo
Copy link

nv-hwoo commented Feb 26, 2025

PR has been merged to main now. You can also try pip installing the latest genai-perf with the following command:

pip install git+https://github.com/triton-inference-server/perf_analyzer.git@main#subdirectory=genai-perf

@pziecina-nv
Copy link
Author

This resolves the issue. Thanks for quick fix!

@piotrm-nvidia
Copy link
Contributor

#276

Pls review fix necessary in Triton Distributed.

@nv-hwoo
Copy link

nv-hwoo commented Feb 26, 2025

@piotrm-nvidia I don't have the "approve" button but the fix looks good to me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants