Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Worker <-> fabric EDM connection teardown race #16634

Open
Tracked by #16632
SeanNijjar opened this issue Jan 10, 2025 · 0 comments · May be fixed by #17033
Open
Tracked by #16632

Fix Worker <-> fabric EDM connection teardown race #16634

SeanNijjar opened this issue Jan 10, 2025 · 0 comments · May be fixed by #17033
Assignees

Comments

@SeanNijjar
Copy link
Contributor

SeanNijjar commented Jan 10, 2025

A race condition was recently discovered that can sometimes lead to hangs. This race condition occurs during connection teardown between workers and fabric EDM endpoints.

A bandaid has been created in this PR: #16636

The general problem is that there is aliasing between the flow control and teardown. The worker only has one semaphore for both but it should have two. At the moment, there is a possibility of double increment from EDM -> worker.

Image

There likely a few options to address this problem but the simplest is probably to just add a new semaphore on the worker side and store it in the ttnn/cpp/ttnn/operations/ccl/kernels/edm_fabric/edm_fabric_worker_adapters.hpp.

  • Device side can by updated by extending WorkerToFabricEdmSender::build_from_args to read in the runtime args.
  • Host side can be updated by extending append_worker_to_fabric_edm_sender_rt_args in ttnn/cpp/ttnn/operations/ccl/erisc_datamover_builder.cpp to pass the semaphore as well as FabricEriscDatamoverBuilder::build to create the semaphore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants