You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A race condition was recently discovered that can sometimes lead to hangs. This race condition occurs during connection teardown between workers and fabric EDM endpoints.
The general problem is that there is aliasing between the flow control and teardown. The worker only has one semaphore for both but it should have two. At the moment, there is a possibility of double increment from EDM -> worker.
There likely a few options to address this problem but the simplest is probably to just add a new semaphore on the worker side and store it in the ttnn/cpp/ttnn/operations/ccl/kernels/edm_fabric/edm_fabric_worker_adapters.hpp.
Device side can by updated by extending WorkerToFabricEdmSender::build_from_args to read in the runtime args.
Host side can be updated by extending append_worker_to_fabric_edm_sender_rt_args in ttnn/cpp/ttnn/operations/ccl/erisc_datamover_builder.cpp to pass the semaphore as well as FabricEriscDatamoverBuilder::build to create the semaphore.
The text was updated successfully, but these errors were encountered:
A race condition was recently discovered that can sometimes lead to hangs. This race condition occurs during connection teardown between workers and fabric EDM endpoints.
A bandaid has been created in this PR: #16636
The general problem is that there is aliasing between the flow control and teardown. The worker only has one semaphore for both but it should have two. At the moment, there is a possibility of double increment from EDM -> worker.
There likely a few options to address this problem but the simplest is probably to just add a new semaphore on the worker side and store it in the
ttnn/cpp/ttnn/operations/ccl/kernels/edm_fabric/edm_fabric_worker_adapters.hpp
.WorkerToFabricEdmSender::build_from_args
to read in the runtime args.append_worker_to_fabric_edm_sender_rt_args
inttnn/cpp/ttnn/operations/ccl/erisc_datamover_builder.cpp
to pass the semaphore as well asFabricEriscDatamoverBuilder::build
to create the semaphore.The text was updated successfully, but these errors were encountered: