Multi MeshCQ and MeshEvents API Bringup #17582

tt-asaigal · 2025-02-05T06:50:27Z

Ticket

No Ticket.

Problem description

Multi-MeshCQ handling and MeshEvent synchronization APIs need to be added to the TT-Mesh layer and exposed to users through the distributed header.
This allows users to parallelize the dispatch of data-movement and control operations to a Virtual Mesh.
Additionally, this brings TT-Mesh APIs to parity with core TT-Metal APIs (except for Trace, which is a performance feature), which allows the MeshCommandQueue to function independently of the HardwareCommandQueue.

What's changed

Add the MeshEvent class to mesh_event.hpp with associated APIs to distributed.hpp. The implementation for each API is present in mesh_command_queue.cpp.
Allow a MeshDevice to be initialized with 2 MeshCommandQueues. As is the case with the single device setup, a MeshDevice must use ethernet dispatch on N300 and T3K systems when exposing multiple command queues.
Move command assembly for EnqueueRecordEvent and EnqueueWaitForEvent to a shared header, which allows logic to be reused in the MeshCommandQueue.
Completely remove the use of HardwareCommandQueue from MeshCommandQueue, as well as any bookkeeping done to keep both data-structures in sync. The MeshCommandQueue now interfaces directly with the SystemMemoryManager to issue all commands to the Virtual Mesh.
Write a custom implementation for MeshCommandQueue::finish() which relies on MeshCommandQueue::drain_events_from_completion_queue(), since the current implementation is entirely single threaded.
Add API to get_dispatch_core() query to dispatch_query_manager
Add tests for MeshEvents.
Unrelated to MeshEvent: Minor modifications for sending go signals to physical devices not involved in a MeshWorkload. This now accounts for SubDevice.

Checklist

All post commit CI passes
Blackhole Post commit CI passes (if applicable)
Model regression CI passes (if applicable)
Device performance regression CI passes (if applicable)
(For models and ops writers) Full new models tests CI passes (if applicable)
New/Existing tests provide coverage for changes

tests/tt_metal/tt_metal/common/multi_device_fixture.hpp

omilyutin-tt · 2025-02-06T17:15:35Z

tt_metal/api/tt-metalium/mesh_device.hpp

@@ -236,7 +236,7 @@ class MeshDevice : public IDevice, public std::enable_shared_from_this<MeshDevic

    // These methods will get removed once in favour of the ones in IDevice* and TT-Mesh bringup
    // These are prefixed with "mesh_" to avoid conflicts with the IDevice* methods
-    MeshCommandQueue& mesh_command_queue();
+    MeshCommandQueue& mesh_command_queue(std::size_t cq_id = 0) const;


@ayerofieiev-tt is making a change to make this type strong, I think we should do the same for MeshQueueId upfront?

I don't think a strongly typed MeshCommandQueueId object is different from a CommandQueueId object. I'm happy adding something like this (which really should be shared between TT-Mesh and TT-Metal):

class CommandQueueId { public: explicit constexpr CommandQueueId(std::size_t id) : id_(id) {} constexpr operator std::size_t() const { return id_; } constexpr std::size_t value() const { return id_; } constexpr bool operator==(const CommandQueueId& other) const { return id_ == other.id_; } constexpr bool operator!=(const CommandQueueId& other) const { return !(*this == other); } private: std::size_t id_; };

for the purposes of this PR, but I don't want to clobber any of the work Artem is doing. I think it makes sense to consolidate Artem's changes into the MeshCommandQueue once they're on main.

The strongly typed object here is so that we don't interop with single-device CQ, and instead explicitly work with the mesh variant. We don't want the interop, right?

Also we have a wrapper for StrongType, so defining is just a matter of: using MeshQueueId = tt::stl::StrongType<uint32_t, struct MeshQueueIdTag>;

tt_metal/api/tt-metalium/mesh_event.hpp

omilyutin-tt · 2025-02-06T17:18:39Z

tt_metal/api/tt-metalium/mesh_event.hpp

+#include "mesh_device.hpp"
+
+namespace tt::tt_metal::distributed {
+using LogicalDeviceRange = CoreRange;


I think this should live elsewhere... Also can you add a TODO for me to switch this over to a typed DeviceRange? This is the issue #17477

Moved to mesh_device_view.hpp , where coordinate systems are currently defined, with a TODO

can remove this here in favor of the other one?

omilyutin-tt · 2025-02-06T17:30:08Z

tt_metal/api/tt-metalium/distributed.hpp

+    tt::stl::Span<const SubDeviceId> sub_device_ids = {},
+    const std::optional<LogicalDeviceRange>& device_range = std::nullopt);
+
+void EnqueueRecordEventToHost(


According to the distributed spec, this notifies all receivers, including host and other devices? Maybe a pair of these will make it more clear:

// Notifies all receives, including the host, on event completion. EnqueueRecordEvent(...); // Notifies all receivers on the device local CQ. EnqueueRecordLocalEvent

The spec is unclear when it mentions this: EnqueueRecordEventToHost: Have a CQ notify all receivers (including Host) of event completion.

We don't have device to device event notifications today - a device either records an event locally or sends it back to host. I was trying to differentiate the two by explicitly informing the user that EnqueueRecordEventToHost will write an event to host, which is a heavier task than recording it locally.

This is helpful info that would be good to add as comments on the APIs

Can we document some of the other APIs as well?

yes, I'll document all TT-Mesh APIs similar to what we do with host_api.hpp once functional parity is achieved.

tt_metal/distributed/mesh_command_queue.cpp

tt_metal/impl/dispatch/dispatch_query_manager.cpp

tt_metal/impl/dispatch/dispatch_query_manager.hpp

tt_metal/impl/event/dispatch.cpp

omilyutin-tt · 2025-02-07T05:53:59Z

tt_metal/impl/dispatch/dispatch_core_manager.cpp

+    dispatch_core_placement_t& assignment = this->dispatch_core_assignments[device_id][channel][cq_id];
+    return assignment.dispatcher_d.has_value();
+}
+
 const tt_cxy_pair& dispatch_core_manager::dispatcher_d_core(chip_id_t device_id, uint16_t channel, uint8_t cq_id) {


Just to clarify - this method checks if the core is allocated, and if not allocates it? The API looks as if it is just a getter.

Perhaps 2 methods that return an optional<tt_cxy_pair>, and the second one that explicitly allocates the core would be cleaner.

I agree, we need explicit behaviour here. For the accessor/modifier I added to this PR, I'm following the convention used for all other queries. I think we should have separate work for cleaning up these APIs in general.

tt_metal/impl/dispatch/dispatch_query_manager.cpp

cfjchu · 2025-02-06T23:38:44Z

tt_metal/api/tt-metalium/mesh_event.hpp

+#include "mesh_device.hpp"
+
+namespace tt::tt_metal::distributed {
+using LogicalDeviceRange = CoreRange;


can remove this here in favor of the other one?

cfjchu · 2025-02-07T05:43:51Z

tests/tt_metal/distributed/test_mesh_events.cpp

+        for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) {
+            for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) {
+                readback_vecs.push_back({});
+                auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x));
+                ReadShard(
+                    mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x));
+            }
+        }


@omilyutin-tt does it make sense to add some logic for EnqueueReadMeshBuffer for replicated path so we can cleanup some of this scaffolding?

TT_FATAL( buffer->global_layout() == MeshBufferLayout::SHARDED, "Can only read a Sharded MeshBuffer from a MeshDevice.");

I think "replicated" and "sharded" should be property of the write API, not the buffer itself. Is it possible to mutate the data on each shard after the fact (so you replicate initial data, mutate it, then read back individual shards)? Let's chat on this separately, I think we can come up with a much cleaner model for this.

yeah agreed, we can make this cleaner and it'll help our own testing.

cfjchu · 2025-02-07T05:59:46Z

tests/tt_metal/distributed/test_mesh_events.cpp

+        std::vector<std::vector<uint32_t>> readback_vecs = {};
+        std::shared_ptr<MeshEvent> event = std::make_shared<MeshEvent>();
+        // Writes on CQ 0
+        EnqueueWriteMeshBuffer(mesh_device_->mesh_command_queue(0), buf, src_vec);


@omilyutin-tt for sharding, we have a way of specifying subset of devices. We don't have similar expressiveness for replication. Is there someone we can add for our metal testing?

Ack, let's chat on this offline

cfjchu · 2025-02-07T06:00:52Z

tt_metal/api/tt-metalium/distributed.hpp

+    tt::stl::Span<const SubDeviceId> sub_device_ids = {},
+    const std::optional<LogicalDeviceRange>& device_range = std::nullopt);
+
+void EnqueueRecordEventToHost(


This is helpful info that would be good to add as comments on the APIs

cfjchu · 2025-02-07T06:01:22Z

tt_metal/api/tt-metalium/distributed.hpp

+    tt::stl::Span<const SubDeviceId> sub_device_ids = {},
+    const std::optional<LogicalDeviceRange>& device_range = std::nullopt);
+
+void EnqueueRecordEventToHost(


Can we document some of the other APIs as well?

tests/tt_metal/distributed/test_mesh_events.cpp

omilyutin-tt

Some minor comments left, thanks!

tt_metal/impl/dispatch/dispatch_query_manager.cpp

tt_metal/impl/event/dispatch.cpp

omilyutin-tt · 2025-02-07T06:10:35Z

tt_metal/impl/dispatch/dispatch_core_manager.cpp

+    dispatch_core_placement_t& assignment = this->dispatch_core_assignments[device_id][channel][cq_id];
+    return assignment.dispatcher_d.has_value();
+}
+
 const tt_cxy_pair& dispatch_core_manager::dispatcher_d_core(chip_id_t device_id, uint16_t channel, uint8_t cq_id) {


Perhaps 2 methods that return an optional<tt_cxy_pair>, and the second one that explicitly allocates the core would be cleaner.

omilyutin-tt · 2025-02-07T06:13:34Z

tests/tt_metal/distributed/test_mesh_events.cpp

+        std::vector<std::vector<uint32_t>> readback_vecs = {};
+        std::shared_ptr<MeshEvent> event = std::make_shared<MeshEvent>();
+        // Writes on CQ 0
+        EnqueueWriteMeshBuffer(mesh_device_->mesh_command_queue(0), buf, src_vec);


Ack, let's chat on this offline

omilyutin-tt · 2025-02-07T06:17:56Z

tests/tt_metal/distributed/test_mesh_events.cpp

+        for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) {
+            for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) {
+                readback_vecs.push_back({});
+                auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x));
+                ReadShard(
+                    mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x));
+            }
+        }


I think "replicated" and "sharded" should be property of the write API, not the buffer itself. Is it possible to mutate the data on each shard after the fact (so you replicate initial data, mutate it, then read back individual shards)? Let's chat on this separately, I think we can come up with a much cleaner model for this.

tt_metal/distributed/mesh_command_queue.cpp

tt_metal/api/tt-metalium/mesh_command_queue.hpp

tt_metal/api/tt-metalium/mesh_event.hpp

tt_metal/api/tt-metalium/mesh_device_view.hpp

cfjchu · 2025-02-07T06:21:15Z

tests/tt_metal/distributed/test_mesh_events.cpp

+        for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) {
+            for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) {
+                readback_vecs.push_back({});
+                auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x));
+                ReadShard(
+                    mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x));
+            }
+        }


yeah agreed, we can make this cleaner and it'll help our own testing.

tt_metal/distributed/mesh_command_queue.cpp

- Natively support Host <-> MeshCQ and MeshCQ <-> MeshCQ synchronization in TT-Mesh - Enable users to access up to 2 MeshCQs through MeshDevice - Add event synchronization APIs to distributed.hpp as per the spec - Share command assembly related to event APIs between MeshCQ and HardwareCommandQueue - With all core TT-Metal functionality added to TT-Mesh, the MeshCQ no longer relies on the single device HardwareCommandQueue to be available or initialized - Remove all bookkeeping done in MeshCQ to maintain shared state with HardwareCommandQueue - Add MeshEvent tests - Minor fixup for sending go signals to devices not involved in a MeshWorkload when SubDevices are loaded

tt-asaigal requested review from cfjchu, omilyutin-tt, aliuTT, abhullar-tt, pgkeller, tt-aho, tt-dma and ubcheema as code owners February 5, 2025 06:50

tt-asaigal force-pushed the asaigal/mesh_event branch from 231f95a to 5dff352 Compare February 5, 2025 18:07

omilyutin-tt requested changes Feb 6, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_event branch 2 times, most recently from 770d726 to 59c0606 Compare February 6, 2025 22:16

omilyutin-tt reviewed Feb 7, 2025

View reviewed changes

tt_metal/impl/dispatch/dispatch_query_manager.cpp Outdated Show resolved Hide resolved

cfjchu reviewed Feb 7, 2025

View reviewed changes

omilyutin-tt approved these changes Feb 7, 2025

View reviewed changes

cfjchu reviewed Feb 7, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_event branch from 59c0606 to 6b6b417 Compare February 7, 2025 16:25

cfjchu approved these changes Feb 7, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_event branch from 6b6b417 to a995150 Compare February 7, 2025 17:53

tt-asaigal merged commit d54089c into main Feb 7, 2025
11 checks passed

tt-asaigal deleted the asaigal/mesh_event branch February 7, 2025 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi MeshCQ and MeshEvents API Bringup #17582

Multi MeshCQ and MeshEvents API Bringup #17582

tt-asaigal commented Feb 5, 2025

omilyutin-tt Feb 6, 2025

tt-asaigal Feb 6, 2025

omilyutin-tt Feb 7, 2025

omilyutin-tt Feb 6, 2025

tt-asaigal Feb 6, 2025

cfjchu Feb 6, 2025

omilyutin-tt Feb 6, 2025

tt-asaigal Feb 6, 2025

cfjchu Feb 7, 2025

cfjchu Feb 7, 2025

tt-asaigal Feb 7, 2025

omilyutin-tt Feb 7, 2025

omilyutin-tt Feb 7, 2025

tt-asaigal Feb 7, 2025

cfjchu Feb 6, 2025

cfjchu Feb 7, 2025

omilyutin-tt Feb 7, 2025

cfjchu Feb 7, 2025

cfjchu Feb 7, 2025

omilyutin-tt Feb 7, 2025

cfjchu Feb 7, 2025

cfjchu Feb 7, 2025

omilyutin-tt left a comment

omilyutin-tt Feb 7, 2025

omilyutin-tt Feb 7, 2025

omilyutin-tt Feb 7, 2025

cfjchu Feb 7, 2025

Multi MeshCQ and MeshEvents API Bringup #17582

Multi MeshCQ and MeshEvents API Bringup #17582

Conversation

tt-asaigal commented Feb 5, 2025

Ticket

Problem description

What's changed

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omilyutin-tt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment