Skip to content
This repository was archived by the owner on Mar 8, 2025. It is now read-only.

Runtime tests python unit test_sanity fails in request plane with missing _nats_client #229

Open
piotrm-nvidia opened this issue Feb 21, 2025 · 0 comments
Assignees

Comments

@piotrm-nvidia
Copy link
Contributor

piotrm-nvidia commented Feb 21, 2025

The reproduction rate jest much below 100%.

Error:

  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 153, in close
    if self._nats_client:
       ^^^^^^^^^^^^^^^^^
AttributeError: 'NatsRequestPlane' object has no attribute '_nats_client'

https://gitlab-master.nvidia.com/dl/triton/triton-distributed-ci/-/jobs/143500717#L385

Longer log:

runtime/tests/python/unit/test_logger.py::mypy PASSED                    [100%]
=================================== FAILURES ===================================
_________________________________ test_sanity __________________________________
    def test_sanity():
        deployment_command = [
            "python3",
            "-m",
            "hello_world.deploy",
            "--initialize-request-plane",
        ]
    
        deployment_process = subprocess.Popen(
            deployment_command,
            stdin=subprocess.DEVNULL,
        )
    
        client_command = [
            "python3",
            "-m",
            "hello_world.client",
            "--requests-per-client",
            "10",
        ]
    
        client_process = subprocess.Popen(
            client_command,
            stdin=subprocess.DEVNULL,
        )
        try:
            client_process.wait(timeout=60)
        except subprocess.TimeoutExpired:
            print("Client timed out!")
            client_process.terminate()
            client_process.wait()
    
        client_process.terminate()
        client_process.kill()
        client_process.wait()
        deployment_process.terminate()
        deployment_process.wait()
>       assert client_process.returncode == 0, "Error in clients!"
E       AssertionError: Error in clients!
E       assert 1 == 0
E        +  where 1 = <Popen: returncode: 1 args: ['python3', '-m', 'hello_world.client', '--reque...>.returncode
client_command = ['python3', '-m', 'hello_world.client', '--requests-per-client', '10']
client_process = <Popen: returncode: 1 args: ['python3', '-m', 'hello_world.client', '--reque...>
deployment_command = ['python3', '-m', 'hello_world.deploy', '--initialize-request-plane']
deployment_process = <Popen: returncode: 2 args: ['python3', '-m', 'hello_world.deploy', '--initi...>
examples/python/hello_world/tests/test_sanity.py:84: AssertionError
----------------------------- Captured stdout call -----------------------------
Starting Workers
00:12:45.473 deployment.py:119 [triton_distributed.runtime.deployment] INFO: 
Starting Worker:
	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7f6a0576a2a0>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder',
                                       implementation=<class 'triton_distributed.runtime.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/python/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'},
                                                                             'input_copies': {'string_value': '1'}}}},
                                       log_level=None)],
             name='encoder.0',
             log_dir='/workspace/examples/python/hello_world/logs',
             metrics_port=50000)
	<SpawnProcess name='encoder.0' parent=203 initial>
00:12:45.475 deployment.py:119 [triton_distributed.runtime.deployment] INFO: 
Starting Worker:
	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7f6a0576a2a0>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='decoder',
                                       implementation=<class 'triton_distributed.runtime.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/python/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'},
                                                                             'input_copies': {'string_value': '1'}}}},
                                       log_level=None)],
             name='decoder.0',
             log_dir='/workspace/examples/python/hello_world/logs',
             metrics_port=50001)
	<SpawnProcess name='decoder.0' parent=203 initial>
00:12:45.476 deployment.py:119 [triton_distributed.runtime.deployment] INFO: 
Starting Worker:
	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7f6a0576a2a0>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder_decoder',
                                       implementation='EncodeDecodeOperator',
                                       repository='/workspace/examples/python/hello_world/operators',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={},
                                       log_level=None)],
             name='encoder_decoder.0',
             log_dir='/workspace/examples/python/hello_world/logs',
             metrics_port=50002)
	<SpawnProcess name='encoder_decoder.0' parent=203 initial>
Workers started ... press Ctrl-C to Exit
00:12:46.003 worker.py:390 [Triton Distributed Runtime] INFO: Starting Worker ==> encoder.0
00:12:46.022 worker.py:390 [Triton Distributed Runtime] INFO: Starting Worker ==> decoder.0
00:12:46.025 worker.py:390 [Triton Distributed Runtime] INFO: Starting Worker ==> encoder_decoder.0
Exception: nats: ServiceUnavailableError: code=503 err_code=10077 description='error opening msg block file ["/tmp/nats_store/jetstream/$G/streams/model-encoder_decoder-1/msgs/2.blk"]: open /tmp/nats_store/jetstream/$G/streams/model-encoder_decoder-1/msgs/2.blk: no such file or directory'
Throughput: 14.732658855481125 Total Time: 0.6787641048431396
Clients Stopped Exit Code 1
00:12:46.243 server.py:83 [uvicorn.error] INFO: Started server process [470]
00:12:46.243 on.py:48 [uvicorn.error] INFO: Waiting for application startup.
00:12:46.244 on.py:62 [uvicorn.error] INFO: Application startup complete.
00:12:46.244 server.py:215 [uvicorn.error] INFO: Uvicorn running on http://127.0.0.1:50000/ (Press CTRL+C to quit)
Stopping Workers
00:12:46.254 deployment.py:132 [triton_distributed.runtime.deployment] INFO: 
Stopping Worker:
	<SpawnProcess name='encoder.0' pid=470 parent=203 started>
00:12:46.254 deployment.py:132 [triton_distributed.runtime.deployment] INFO: 
Stopping Worker:
	<SpawnProcess name='decoder.0' pid=472 parent=203 started>
00:12:46.255 deployment.py:132 [triton_distributed.runtime.deployment] INFO: 
Stopping Worker:
	<SpawnProcess name='encoder_decoder.0' pid=473 parent=203 started>
00:12:46.262 server.py:83 [uvicorn.error] INFO: Started server process [472]
00:12:46.262 on.py:48 [uvicorn.error] INFO: Waiting for application startup.
00:12:46.262 on.py:62 [uvicorn.error] INFO: Application startup complete.
00:12:46.262 worker.py:317 [Triton Distributed Runtime] INFO: Received exit signal SIGTERM...
00:12:46.262 worker.py:329 [Triton Distributed Runtime] ERROR: Failed to close the request plane: 'NatsRequestPlane' object has no attribute '_nats_client'
Traceback (most recent call last):
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 327, in shutdown
    await self._request_plane.close()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 153, in close
    if self._nats_client:
       ^^^^^^^^^^^^^^^^^
AttributeError: 'NatsRequestPlane' object has no attribute '_nats_client'
00:12:46.263 worker.py:332 [Triton Distributed Runtime] INFO: Cancelling 3 outstanding tasks
00:12:46.264 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:46.264 worker.py:374 [Triton Distributed Runtime] ERROR: Task failed, msg=Task exception was never retrieved, exception='Server' object has no attribute 'servers'
00:12:46.264 on.py:134 [uvicorn.error] ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/message_logger.py", line 60, in inner_receive
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
00:12:46.264 server.py:83 [uvicorn.error] INFO: Started server process [473]
00:12:46.264 on.py:48 [uvicorn.error] INFO: Waiting for application startup.
00:12:46.264 worker.py:412 [Triton Distributed Runtime] INFO: Worker cancelled!
00:12:46.264 worker.py:384 [Triton Distributed Runtime] INFO: Stopping the event loop
00:12:46.264 on.py:62 [uvicorn.error] INFO: Application startup complete.
00:12:46.264 worker.py:416 [Triton Distributed Runtime] INFO: Successfully shutdown worker.
00:12:46.264 worker.py:317 [Triton Distributed Runtime] INFO: Received exit signal SIGTERM...
00:12:46.265 worker.py:329 [Triton Distributed Runtime] ERROR: Failed to close the request plane: 'NatsRequestPlane' object has no attribute '_nats_client'
Traceback (most recent call last):
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 327, in shutdown
    await self._request_plane.close()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 153, in close
    if self._nats_client:
       ^^^^^^^^^^^^^^^^^
AttributeError: 'NatsRequestPlane' object has no attribute '_nats_client'
00:12:46.266 worker.py:332 [Triton Distributed Runtime] INFO: Cancelling 4 outstanding tasks
00:12:46.266 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:46.266 worker.py:374 [Triton Distributed Runtime] ERROR: Task failed, msg=Task exception was never retrieved, exception='Server' object has no attribute 'servers'
00:12:46.266 on.py:134 [uvicorn.error] ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/message_logger.py", line 60, in inner_receive
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
00:12:46.266 worker.py:412 [Triton Distributed Runtime] INFO: Worker cancelled!
00:12:46.266 worker.py:384 [Triton Distributed Runtime] INFO: Stopping the event loop
00:12:46.267 worker.py:416 [Triton Distributed Runtime] INFO: Successfully shutdown worker.
00:12:48.345 worker.py:292 [Triton Distributed Runtime] INFO: Worker started...
00:12:48.345 worker.py:268 [Triton Distributed Runtime] INFO: Starting encoder handler...
00:12:48.346 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:48.346 worker.py:317 [Triton Distributed Runtime] INFO: Received exit signal SIGTERM...
00:12:48.347 worker.py:332 [Triton Distributed Runtime] INFO: Cancelling 5 outstanding tasks
00:12:50.347 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:50.349 on.py:134 [uvicorn.error] ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/message_logger.py", line 60, in inner_receive
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
00:12:50.349 worker.py:412 [Triton Distributed Runtime] INFO: Worker cancelled!
00:12:50.349 worker.py:300 [Triton Distributed Runtime] INFO: worker store: []
00:12:50.349 worker.py:301 [Triton Distributed Runtime] INFO: Worker stopped...
00:12:50.349 worker.py:302 [Triton Distributed Runtime] INFO: Hosted Operators: {('encoder', 1): <triton_distributed.runtime.triton_core_operator.TritonCoreOperator object at 0x7f0edc33ec00>} Requests Received: Counter() Responses Sent: Counter()
00:12:50.349 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:50.449 worker.py:384 [Triton Distributed Runtime] INFO: Stopping the event loop
00:12:50.450 worker.py:416 [Triton Distributed Runtime] INFO: Successfully shutdown worker.
00:12:51.295 deployment.py:141 [triton_distributed.runtime.deployment] INFO: 
Worker Stopped:
	<SpawnProcess name='encoder.0' pid=470 parent=203 stopped exitcode=0>
00:12:51.295 deployment.py:141 [triton_distributed.runtime.deployment] INFO: 
Worker Stopped:
	<SpawnProcess name='decoder.0' pid=472 parent=203 stopped exitcode=1>
00:12:51.295 deployment.py:141 [triton_distributed.runtime.deployment] INFO: 
Worker Stopped:
	<SpawnProcess name='encoder_decoder.0' pid=473 parent=203 stopped exitcode=1>
Workers Stopped Exit Code 2
----------------------------- Captured stderr call -----------------------------
Process decoder.0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/runtime/python/src/triton_distributed/runtime/deployment.py", line 65, in _start_worker
    Worker(worker_config).start()
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 418, in start
    exit_condition = serve_result.result()
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 275, in serve
    await self._request_plane.connect()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 260, in connect
    self._nats_client = await nats.connect(self._request_plane_uri)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nats/__init__.py", line 45, in connect
    await nc.connect(servers, **options)
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 503, in connect
    await self._select_next_server()
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1104, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1007, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.12/asyncio/selector_events.py", line 651, in sock_connect
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledError
Process encoder_decoder.0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/runtime/python/src/triton_distributed/runtime/deployment.py", line 65, in _start_worker
    Worker(worker_config).start()
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 418, in start
    exit_condition = serve_result.result()
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 275, in serve
    await self._request_plane.connect()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 260, in connect
    self._nats_client = await nats.connect(self._request_plane_uri)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nats/__init__.py", line 45, in connect
    await nc.connect(servers, **options)
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 503, in connect
    await self._select_next_server()
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1104, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1007, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.12/asyncio/selector_events.py", line 651, in sock_connect
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledErrorruntime/tests/python/unit/test_logger.py::mypy PASSED                    [100%]
=================================== FAILURES ===================================
_________________________________ test_sanity __________________________________
    def test_sanity():
        deployment_command = [
            "python3",
            "-m",
            "hello_world.deploy",
            "--initialize-request-plane",
        ]
    
        deployment_process = subprocess.Popen(
            deployment_command,
            stdin=subprocess.DEVNULL,
        )
    
        client_command = [
            "python3",
            "-m",
            "hello_world.client",
            "--requests-per-client",
            "10",
        ]
    
        client_process = subprocess.Popen(
            client_command,
            stdin=subprocess.DEVNULL,
        )
        try:
            client_process.wait(timeout=60)
        except subprocess.TimeoutExpired:
            print("Client timed out!")
            client_process.terminate()
            client_process.wait()
    
        client_process.terminate()
        client_process.kill()
        client_process.wait()
        deployment_process.terminate()
        deployment_process.wait()
>       assert client_process.returncode == 0, "Error in clients!"
E       AssertionError: Error in clients!
E       assert 1 == 0
E        +  where 1 = <Popen: returncode: 1 args: ['python3', '-m', 'hello_world.client', '--reque...>.returncode
client_command = ['python3', '-m', 'hello_world.client', '--requests-per-client', '10']
client_process = <Popen: returncode: 1 args: ['python3', '-m', 'hello_world.client', '--reque...>
deployment_command = ['python3', '-m', 'hello_world.deploy', '--initialize-request-plane']
deployment_process = <Popen: returncode: 2 args: ['python3', '-m', 'hello_world.deploy', '--initi...>
examples/python/hello_world/tests/test_sanity.py:84: AssertionError
----------------------------- Captured stdout call -----------------------------
Starting Workers
00:12:45.473 deployment.py:119 [triton_distributed.runtime.deployment] INFO: 
Starting Worker:
	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7f6a0576a2a0>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder',
                                       implementation=<class 'triton_distributed.runtime.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/python/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'},
                                                                             'input_copies': {'string_value': '1'}}}},
                                       log_level=None)],
             name='encoder.0',
             log_dir='/workspace/examples/python/hello_world/logs',
             metrics_port=50000)
	<SpawnProcess name='encoder.0' parent=203 initial>
00:12:45.475 deployment.py:119 [triton_distributed.runtime.deployment] INFO: 
Starting Worker:
	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7f6a0576a2a0>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='decoder',
                                       implementation=<class 'triton_distributed.runtime.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/python/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'},
                                                                             'input_copies': {'string_value': '1'}}}},
                                       log_level=None)],
             name='decoder.0',
             log_dir='/workspace/examples/python/hello_world/logs',
             metrics_port=50001)
	<SpawnProcess name='decoder.0' parent=203 initial>
00:12:45.476 deployment.py:119 [triton_distributed.runtime.deployment] INFO: 
Starting Worker:
	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x7f6a0576a2a0>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder_decoder',
                                       implementation='EncodeDecodeOperator',
                                       repository='/workspace/examples/python/hello_world/operators',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={},
                                       log_level=None)],
             name='encoder_decoder.0',
             log_dir='/workspace/examples/python/hello_world/logs',
             metrics_port=50002)
	<SpawnProcess name='encoder_decoder.0' parent=203 initial>
Workers started ... press Ctrl-C to Exit
00:12:46.003 worker.py:390 [Triton Distributed Runtime] INFO: Starting Worker ==> encoder.0
00:12:46.022 worker.py:390 [Triton Distributed Runtime] INFO: Starting Worker ==> decoder.0
00:12:46.025 worker.py:390 [Triton Distributed Runtime] INFO: Starting Worker ==> encoder_decoder.0
Exception: nats: ServiceUnavailableError: code=503 err_code=10077 description='error opening msg block file ["/tmp/nats_store/jetstream/$G/streams/model-encoder_decoder-1/msgs/2.blk"]: open /tmp/nats_store/jetstream/$G/streams/model-encoder_decoder-1/msgs/2.blk: no such file or directory'
Throughput: 14.732658855481125 Total Time: 0.6787641048431396
Clients Stopped Exit Code 1
00:12:46.243 server.py:83 [uvicorn.error] INFO: Started server process [470]
00:12:46.243 on.py:48 [uvicorn.error] INFO: Waiting for application startup.
00:12:46.244 on.py:62 [uvicorn.error] INFO: Application startup complete.
00:12:46.244 server.py:215 [uvicorn.error] INFO: Uvicorn running on http://127.0.0.1:50000/ (Press CTRL+C to quit)
Stopping Workers
00:12:46.254 deployment.py:132 [triton_distributed.runtime.deployment] INFO: 
Stopping Worker:
	<SpawnProcess name='encoder.0' pid=470 parent=203 started>
00:12:46.254 deployment.py:132 [triton_distributed.runtime.deployment] INFO: 
Stopping Worker:
	<SpawnProcess name='decoder.0' pid=472 parent=203 started>
00:12:46.255 deployment.py:132 [triton_distributed.runtime.deployment] INFO: 
Stopping Worker:
	<SpawnProcess name='encoder_decoder.0' pid=473 parent=203 started>
00:12:46.262 server.py:83 [uvicorn.error] INFO: Started server process [472]
00:12:46.262 on.py:48 [uvicorn.error] INFO: Waiting for application startup.
00:12:46.262 on.py:62 [uvicorn.error] INFO: Application startup complete.
00:12:46.262 worker.py:317 [Triton Distributed Runtime] INFO: Received exit signal SIGTERM...
00:12:46.262 worker.py:329 [Triton Distributed Runtime] ERROR: Failed to close the request plane: 'NatsRequestPlane' object has no attribute '_nats_client'
Traceback (most recent call last):
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 327, in shutdown
    await self._request_plane.close()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 153, in close
    if self._nats_client:
       ^^^^^^^^^^^^^^^^^
AttributeError: 'NatsRequestPlane' object has no attribute '_nats_client'
00:12:46.263 worker.py:332 [Triton Distributed Runtime] INFO: Cancelling 3 outstanding tasks
00:12:46.264 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:46.264 worker.py:374 [Triton Distributed Runtime] ERROR: Task failed, msg=Task exception was never retrieved, exception='Server' object has no attribute 'servers'
00:12:46.264 on.py:134 [uvicorn.error] ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/message_logger.py", line 60, in inner_receive
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
00:12:46.264 server.py:83 [uvicorn.error] INFO: Started server process [473]
00:12:46.264 on.py:48 [uvicorn.error] INFO: Waiting for application startup.
00:12:46.264 worker.py:412 [Triton Distributed Runtime] INFO: Worker cancelled!
00:12:46.264 worker.py:384 [Triton Distributed Runtime] INFO: Stopping the event loop
00:12:46.264 on.py:62 [uvicorn.error] INFO: Application startup complete.
00:12:46.264 worker.py:416 [Triton Distributed Runtime] INFO: Successfully shutdown worker.
00:12:46.264 worker.py:317 [Triton Distributed Runtime] INFO: Received exit signal SIGTERM...
00:12:46.265 worker.py:329 [Triton Distributed Runtime] ERROR: Failed to close the request plane: 'NatsRequestPlane' object has no attribute '_nats_client'
Traceback (most recent call last):
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 327, in shutdown
    await self._request_plane.close()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 153, in close
    if self._nats_client:
       ^^^^^^^^^^^^^^^^^
AttributeError: 'NatsRequestPlane' object has no attribute '_nats_client'
00:12:46.266 worker.py:332 [Triton Distributed Runtime] INFO: Cancelling 4 outstanding tasks
00:12:46.266 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:46.266 worker.py:374 [Triton Distributed Runtime] ERROR: Task failed, msg=Task exception was never retrieved, exception='Server' object has no attribute 'servers'
00:12:46.266 on.py:134 [uvicorn.error] ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/message_logger.py", line 60, in inner_receive
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
00:12:46.266 worker.py:412 [Triton Distributed Runtime] INFO: Worker cancelled!
00:12:46.266 worker.py:384 [Triton Distributed Runtime] INFO: Stopping the event loop
00:12:46.267 worker.py:416 [Triton Distributed Runtime] INFO: Successfully shutdown worker.
00:12:48.345 worker.py:292 [Triton Distributed Runtime] INFO: Worker started...
00:12:48.345 worker.py:268 [Triton Distributed Runtime] INFO: Starting encoder handler...
00:12:48.346 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:48.346 worker.py:317 [Triton Distributed Runtime] INFO: Received exit signal SIGTERM...
00:12:48.347 worker.py:332 [Triton Distributed Runtime] INFO: Cancelling 5 outstanding tasks
00:12:50.347 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:50.349 on.py:134 [uvicorn.error] ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/message_logger.py", line 60, in inner_receive
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
00:12:50.349 worker.py:412 [Triton Distributed Runtime] INFO: Worker cancelled!
00:12:50.349 worker.py:300 [Triton Distributed Runtime] INFO: worker store: []
00:12:50.349 worker.py:301 [Triton Distributed Runtime] INFO: Worker stopped...
00:12:50.349 worker.py:302 [Triton Distributed Runtime] INFO: Hosted Operators: {('encoder', 1): <triton_distributed.runtime.triton_core_operator.TritonCoreOperator object at 0x7f0edc33ec00>} Requests Received: Counter() Responses Sent: Counter()
00:12:50.349 server.py:263 [uvicorn.error] INFO: Shutting down
00:12:50.449 worker.py:384 [Triton Distributed Runtime] INFO: Stopping the event loop
00:12:50.450 worker.py:416 [Triton Distributed Runtime] INFO: Successfully shutdown worker.
00:12:51.295 deployment.py:141 [triton_distributed.runtime.deployment] INFO: 
Worker Stopped:
	<SpawnProcess name='encoder.0' pid=470 parent=203 stopped exitcode=0>
00:12:51.295 deployment.py:141 [triton_distributed.runtime.deployment] INFO: 
Worker Stopped:
	<SpawnProcess name='decoder.0' pid=472 parent=203 stopped exitcode=1>
00:12:51.295 deployment.py:141 [triton_distributed.runtime.deployment] INFO: 
Worker Stopped:
	<SpawnProcess name='encoder_decoder.0' pid=473 parent=203 stopped exitcode=1>
Workers Stopped Exit Code 2
----------------------------- Captured stderr call -----------------------------
Process decoder.0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/runtime/python/src/triton_distributed/runtime/deployment.py", line 65, in _start_worker
    Worker(worker_config).start()
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 418, in start
    exit_condition = serve_result.result()
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 275, in serve
    await self._request_plane.connect()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 260, in connect
    self._nats_client = await nats.connect(self._request_plane_uri)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nats/__init__.py", line 45, in connect
    await nc.connect(servers, **options)
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 503, in connect
    await self._select_next_server()
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1104, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1007, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.12/asyncio/selector_events.py", line 651, in sock_connect
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledError
Process encoder_decoder.0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/runtime/python/src/triton_distributed/runtime/deployment.py", line 65, in _start_worker
    Worker(worker_config).start()
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 418, in start
    exit_condition = serve_result.result()
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/runtime/python/src/triton_distributed/runtime/worker.py", line 275, in serve
    await self._request_plane.connect()
  File "/workspace/icp/python/src/triton_distributed/icp/nats_request_plane.py", line 260, in connect
    self._nats_client = await nats.connect(self._request_plane_uri)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nats/__init__.py", line 45, in connect
    await nc.connect(servers, **options)
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 503, in connect
    await self._select_next_server()
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1104, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1007, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.12/asyncio/selector_events.py", line 651, in sock_connect
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledError

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants