Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Error during runtime Initialization on N300 with multiple devices #1135

Open
sott0n opened this issue Jan 30, 2025 · 0 comments
Open

Comments

@sott0n
Copy link

sott0n commented Jan 30, 2025

I encountered a runtime hang while running a compiled embedding model on multiple devices in my T3000 environment. The model was compiled using Forge, and when executing it with a multi-device topology setup, Device 0 fails to initialize due to a timeout error. The logs indicate that the issue occurs while waiting for the Ethernet cores to finish.

Environment

Error log

...
                  Metal | INFO     | While initializing Device 0, ethernet tunneler core (x=18,y=17) on Device 0 detected as still running, issuing exit signal.
                 Always | FATAL    | Device 0: Timeout (10000 ms) waiting for physical cores to finish: (x=22,y=25), (x=23,y=25), (x=20,y=25), (x=21,y=25).
                 Always | FATAL    | Device 0 init: failed to initialize FW! Try resetting the board.
...

I know that the Forge runtime currently supports only a single device, as mentioned in this comment.
However, instead of failing gracefully, the Forge runtime hangs, leading tt-smi to automatically trigger a board reset. It might be preferable to add an explicit exception when attempting to run Forge runtime on multiple devices with a topology setup to prevent this behavior until multi-device is supported.
And are there any plans to support multi-device execution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant