You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered a runtime hang while running a compiled embedding model on multiple devices in my T3000 environment. The model was compiled using Forge, and when executing it with a multi-device topology setup, Device 0 fails to initialize due to a timeout error. The logs indicate that the issue occurs while waiting for the Ethernet cores to finish.
...
Metal | INFO | While initializing Device 0, ethernet tunneler core (x=18,y=17) on Device 0 detected as still running, issuing exit signal.
Always | FATAL | Device 0: Timeout (10000 ms) waiting for physical cores to finish: (x=22,y=25), (x=23,y=25), (x=20,y=25), (x=21,y=25).
Always | FATAL | Device 0 init: failed to initialize FW! Try resetting the board.
...
I know that the Forge runtime currently supports only a single device, as mentioned in this comment.
However, instead of failing gracefully, the Forge runtime hangs, leading tt-smi to automatically trigger a board reset. It might be preferable to add an explicit exception when attempting to run Forge runtime on multiple devices with a topology setup to prevent this behavior until multi-device is supported.
And are there any plans to support multi-device execution?
The text was updated successfully, but these errors were encountered:
I encountered a runtime hang while running a compiled embedding model on multiple devices in my T3000 environment. The model was compiled using Forge, and when executing it with a multi-device topology setup, Device 0 fails to initialize due to a timeout error. The logs indicate that the issue occurs while waiting for the Ethernet cores to finish.
Environment
Error log
I know that the Forge runtime currently supports only a single device, as mentioned in this comment.
However, instead of failing gracefully, the Forge runtime hangs, leading
tt-smi
to automatically trigger a board reset. It might be preferable to add an explicit exception when attempting to run Forge runtime on multiple devices with a topology setup to prevent this behavior until multi-device is supported.And are there any plans to support multi-device execution?
The text was updated successfully, but these errors were encountered: