Skip to content
This repository was archived by the owner on Mar 8, 2025. It is now read-only.

Outputs from monolithic and disaggregated deployments are not the same #248

Open
glos-nv opened this issue Feb 23, 2025 · 0 comments
Open

Comments

@glos-nv
Copy link
Contributor

glos-nv commented Feb 23, 2025

I'm runing my deployments like this

# step 1: run nats &etcd
nats-server -js --trace &

etcd &

# step 2: run worker(s)
cd /workspace/examples/python_rs/llm/vllm 
source /opt/triton/venv/bin/activate

python3 -m monolith.worker \
            --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
            --max-model-len 1024 \
            --gpu-memory-utilization 0.8 \
            --enforce-eager \
            --tensor-parallel-size 1 &


# step 3:  run api server
TRD_LOG=DEBUG http &
llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B triton-init.vllm.generate

Then I run the client

# step 4: running http clinet

curl localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
      {"role": "system", "content": "What is the capital of France?"}
    ],
    "seed": 1,
    "temperature": 0,
    "top_p": 0.95,
    "max_tokens": 50,
    "min_tokens": 1,
    "n": 1,
    "frequency_penalty": 0.0,
    "stop": []
  }'

In case of dissagregated serving I replace step 2 with

cd /workspace/examples/python_rs/llm/vllm 
source /opt/triton/venv/bin/activate

VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
            --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
            --max-model-len 1024 \
            --gpu-memory-utilization 0.8 \
            --enforce-eager \
            --tensor-parallel-size 1 \
            --kv-transfer-config \
            '{"kv_connector":"TritonNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' &
      
            
VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker \
        --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
        --max-model-len 1024 \
        --gpu-memory-utilization 0.8 \
        --enforce-eager \
        --tensor-parallel-size 1 \
        --kv-transfer-config \
        '{"kv_connector":"TritonNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' &

Regardless of the kind of deployment I would expect the same output, but in case of monolithic deployment I get

{"id":"9dcede72-7092-4dd9-ac7f-d9ae99c5224e","choices":[{"message":{"role":"assistant","content":"Okay, so I need to figure out the capital of France. Hmm, I'm not entirely sure, but I think it's one of the major cities in France. Let me try to recall. I remember that Paris is a big city there,"},"index":0,"finish_reason":"length"}],"created":1740322046,"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B","object":"chat.completion","usage":null,"system_fingerprint":null}

and for disaggregated deployment:

{"id":"85e5cad2-6495-4dd7-b81b-757e4a564c70___decode_hostname_ipp2-0493___decode_kv_rank_1","choices":[{"message":{"role":"assistant","content":"\n\n</think>\n\nThe capital of France is Paris."},"index":0,"finish_reason":"stop"}],"created":1740322505,"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B","object":"chat.completion","usage":null,"system_fingerprint":null}
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant