-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi MeshCQ and MeshEvents API Bringup #17582
Conversation
231f95a
to
5dff352
Compare
@@ -236,7 +236,7 @@ class MeshDevice : public IDevice, public std::enable_shared_from_this<MeshDevic | |||
|
|||
// These methods will get removed once in favour of the ones in IDevice* and TT-Mesh bringup | |||
// These are prefixed with "mesh_" to avoid conflicts with the IDevice* methods | |||
MeshCommandQueue& mesh_command_queue(); | |||
MeshCommandQueue& mesh_command_queue(std::size_t cq_id = 0) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayerofieiev-tt is making a change to make this type strong, I think we should do the same for MeshQueueId
upfront?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a strongly typed MeshCommandQueueId
object is different from a CommandQueueId
object. I'm happy adding something like this (which really should be shared between TT-Mesh and TT-Metal):
class CommandQueueId {
public:
explicit constexpr CommandQueueId(std::size_t id) : id_(id) {}
constexpr operator std::size_t() const { return id_; }
constexpr std::size_t value() const { return id_; }
constexpr bool operator==(const CommandQueueId& other) const { return id_ == other.id_; }
constexpr bool operator!=(const CommandQueueId& other) const { return !(*this == other); }
private:
std::size_t id_;
};
for the purposes of this PR, but I don't want to clobber any of the work Artem is doing. I think it makes sense to consolidate Artem's changes into the MeshCommandQueue
once they're on main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strongly typed object here is so that we don't interop with single-device CQ, and instead explicitly work with the mesh variant. We don't want the interop, right?
Also we have a wrapper for StrongType
, so defining is just a matter of: using MeshQueueId = tt::stl::StrongType<uint32_t, struct MeshQueueIdTag>;
#include "mesh_device.hpp" | ||
|
||
namespace tt::tt_metal::distributed { | ||
using LogicalDeviceRange = CoreRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should live elsewhere... Also can you add a TODO for me to switch this over to a typed DeviceRange
? This is the issue #17477
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to mesh_device_view.hpp
, where coordinate systems are currently defined, with a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove this here in favor of the other one?
tt::stl::Span<const SubDeviceId> sub_device_ids = {}, | ||
const std::optional<LogicalDeviceRange>& device_range = std::nullopt); | ||
|
||
void EnqueueRecordEventToHost( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the distributed spec, this notifies all receivers, including host and other devices? Maybe a pair of these will make it more clear:
// Notifies all receives, including the host, on event completion.
EnqueueRecordEvent(...);
// Notifies all receivers on the device local CQ.
EnqueueRecordLocalEvent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec is unclear when it mentions this: EnqueueRecordEventToHost: Have a CQ notify all receivers (including Host) of event completion.
We don't have device to device event notifications today - a device either records an event locally or sends it back to host. I was trying to differentiate the two by explicitly informing the user that EnqueueRecordEventToHost
will write an event to host, which is a heavier task than recording it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful info that would be good to add as comments on the APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document some of the other APIs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'll document all TT-Mesh APIs similar to what we do with host_api.hpp
once functional parity is achieved.
770d726
to
59c0606
Compare
dispatch_core_placement_t& assignment = this->dispatch_core_assignments[device_id][channel][cq_id]; | ||
return assignment.dispatcher_d.has_value(); | ||
} | ||
|
||
const tt_cxy_pair& dispatch_core_manager::dispatcher_d_core(chip_id_t device_id, uint16_t channel, uint8_t cq_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify - this method checks if the core is allocated, and if not allocates it? The API looks as if it is just a getter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps 2 methods that return an optional<tt_cxy_pair>, and the second one that explicitly allocates the core would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we need explicit behaviour here. For the accessor/modifier I added to this PR, I'm following the convention used for all other queries. I think we should have separate work for cleaning up these APIs in general.
#include "mesh_device.hpp" | ||
|
||
namespace tt::tt_metal::distributed { | ||
using LogicalDeviceRange = CoreRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove this here in favor of the other one?
for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) { | ||
for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) { | ||
readback_vecs.push_back({}); | ||
auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x)); | ||
ReadShard( | ||
mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omilyutin-tt does it make sense to add some logic for EnqueueReadMeshBuffer
for replicated path so we can cleanup some of this scaffolding?
TT_FATAL(
buffer->global_layout() == MeshBufferLayout::SHARDED, "Can only read a Sharded MeshBuffer from a MeshDevice.");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "replicated" and "sharded" should be property of the write API, not the buffer itself. Is it possible to mutate the data on each shard after the fact (so you replicate initial data, mutate it, then read back individual shards)? Let's chat on this separately, I think we can come up with a much cleaner model for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah agreed, we can make this cleaner and it'll help our own testing.
std::vector<std::vector<uint32_t>> readback_vecs = {}; | ||
std::shared_ptr<MeshEvent> event = std::make_shared<MeshEvent>(); | ||
// Writes on CQ 0 | ||
EnqueueWriteMeshBuffer(mesh_device_->mesh_command_queue(0), buf, src_vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omilyutin-tt for sharding, we have a way of specifying subset of devices. We don't have similar expressiveness for replication. Is there someone we can add for our metal testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, let's chat on this offline
tt::stl::Span<const SubDeviceId> sub_device_ids = {}, | ||
const std::optional<LogicalDeviceRange>& device_range = std::nullopt); | ||
|
||
void EnqueueRecordEventToHost( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful info that would be good to add as comments on the APIs
tt::stl::Span<const SubDeviceId> sub_device_ids = {}, | ||
const std::optional<LogicalDeviceRange>& device_range = std::nullopt); | ||
|
||
void EnqueueRecordEventToHost( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document some of the other APIs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments left, thanks!
dispatch_core_placement_t& assignment = this->dispatch_core_assignments[device_id][channel][cq_id]; | ||
return assignment.dispatcher_d.has_value(); | ||
} | ||
|
||
const tt_cxy_pair& dispatch_core_manager::dispatcher_d_core(chip_id_t device_id, uint16_t channel, uint8_t cq_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps 2 methods that return an optional<tt_cxy_pair>, and the second one that explicitly allocates the core would be cleaner.
std::vector<std::vector<uint32_t>> readback_vecs = {}; | ||
std::shared_ptr<MeshEvent> event = std::make_shared<MeshEvent>(); | ||
// Writes on CQ 0 | ||
EnqueueWriteMeshBuffer(mesh_device_->mesh_command_queue(0), buf, src_vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, let's chat on this offline
for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) { | ||
for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) { | ||
readback_vecs.push_back({}); | ||
auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x)); | ||
ReadShard( | ||
mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "replicated" and "sharded" should be property of the write API, not the buffer itself. Is it possible to mutate the data on each shard after the fact (so you replicate initial data, mutate it, then read back individual shards)? Let's chat on this separately, I think we can come up with a much cleaner model for this.
for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) { | ||
for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) { | ||
readback_vecs.push_back({}); | ||
auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x)); | ||
ReadShard( | ||
mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah agreed, we can make this cleaner and it'll help our own testing.
59c0606
to
6b6b417
Compare
- Natively support Host <-> MeshCQ and MeshCQ <-> MeshCQ synchronization in TT-Mesh - Enable users to access up to 2 MeshCQs through MeshDevice - Add event synchronization APIs to distributed.hpp as per the spec - Share command assembly related to event APIs between MeshCQ and HardwareCommandQueue - With all core TT-Metal functionality added to TT-Mesh, the MeshCQ no longer relies on the single device HardwareCommandQueue to be available or initialized - Remove all bookkeeping done in MeshCQ to maintain shared state with HardwareCommandQueue - Add MeshEvent tests - Minor fixup for sending go signals to devices not involved in a MeshWorkload when SubDevices are loaded
6b6b417
to
a995150
Compare
Ticket
No Ticket.
Problem description
MeshEvent
synchronization APIs need to be added to the TT-Mesh layer and exposed to users through thedistributed
header.MeshCommandQueue
to function independently of theHardwareCommandQueue
.What's changed
MeshEvent
class tomesh_event.hpp
with associated APIs todistributed.hpp
. The implementation for each API is present inmesh_command_queue.cpp
.MeshDevice
to be initialized with 2MeshCommandQueues
. As is the case with the single device setup, aMeshDevice
must use ethernet dispatch on N300 and T3K systems when exposing multiple command queues.EnqueueRecordEvent
andEnqueueWaitForEvent
to a shared header, which allows logic to be reused in theMeshCommandQueue
.HardwareCommandQueue
fromMeshCommandQueue
, as well as any bookkeeping done to keep both data-structures in sync. TheMeshCommandQueue
now interfaces directly with theSystemMemoryManager
to issue all commands to the Virtual Mesh.MeshCommandQueue::finish()
which relies onMeshCommandQueue::drain_events_from_completion_queue()
, since the current implementation is entirely single threaded.get_dispatch_core()
query todispatch_query_manager
MeshEvents
.MeshEvent
: Minor modifications for sending go signals to physical devices not involved in aMeshWorkload
. This now accounts forSubDevice
.Checklist