Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi MeshCQ and MeshEvents API Bringup #17582

Merged
merged 1 commit into from
Feb 7, 2025
Merged

Conversation

tt-asaigal
Copy link
Contributor

Ticket

No Ticket.

Problem description

  • Multi-MeshCQ handling and MeshEvent synchronization APIs need to be added to the TT-Mesh layer and exposed to users through the distributed header.
  • This allows users to parallelize the dispatch of data-movement and control operations to a Virtual Mesh.
  • Additionally, this brings TT-Mesh APIs to parity with core TT-Metal APIs (except for Trace, which is a performance feature), which allows the MeshCommandQueue to function independently of the HardwareCommandQueue.

What's changed

  • Add the MeshEvent class to mesh_event.hpp with associated APIs to distributed.hpp. The implementation for each API is present in mesh_command_queue.cpp.
  • Allow a MeshDevice to be initialized with 2 MeshCommandQueues. As is the case with the single device setup, a MeshDevice must use ethernet dispatch on N300 and T3K systems when exposing multiple command queues.
  • Move command assembly for EnqueueRecordEvent and EnqueueWaitForEvent to a shared header, which allows logic to be reused in the MeshCommandQueue.
  • Completely remove the use of HardwareCommandQueue from MeshCommandQueue, as well as any bookkeeping done to keep both data-structures in sync. The MeshCommandQueue now interfaces directly with the SystemMemoryManager to issue all commands to the Virtual Mesh.
  • Write a custom implementation for MeshCommandQueue::finish() which relies on MeshCommandQueue::drain_events_from_completion_queue(), since the current implementation is entirely single threaded.
  • Add API to get_dispatch_core() query to dispatch_query_manager
  • Add tests for MeshEvents.
  • Unrelated to MeshEvent: Minor modifications for sending go signals to physical devices not involved in a MeshWorkload. This now accounts for SubDevice.

Checklist

tests/tt_metal/tt_metal/common/multi_device_fixture.hpp Outdated Show resolved Hide resolved
@@ -236,7 +236,7 @@ class MeshDevice : public IDevice, public std::enable_shared_from_this<MeshDevic

// These methods will get removed once in favour of the ones in IDevice* and TT-Mesh bringup
// These are prefixed with "mesh_" to avoid conflicts with the IDevice* methods
MeshCommandQueue& mesh_command_queue();
MeshCommandQueue& mesh_command_queue(std::size_t cq_id = 0) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayerofieiev-tt is making a change to make this type strong, I think we should do the same for MeshQueueId upfront?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a strongly typed MeshCommandQueueId object is different from a CommandQueueId object. I'm happy adding something like this (which really should be shared between TT-Mesh and TT-Metal):

class CommandQueueId {
public:
    explicit constexpr CommandQueueId(std::size_t id) : id_(id) {}
        constexpr operator std::size_t() const { return id_; }
    
    constexpr std::size_t value() const { return id_; }
    constexpr bool operator==(const CommandQueueId& other) const { return id_ == other.id_; }
    constexpr bool operator!=(const CommandQueueId& other) const { return !(*this == other); }

private:
    std::size_t id_;
};

for the purposes of this PR, but I don't want to clobber any of the work Artem is doing. I think it makes sense to consolidate Artem's changes into the MeshCommandQueue once they're on main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strongly typed object here is so that we don't interop with single-device CQ, and instead explicitly work with the mesh variant. We don't want the interop, right?

Also we have a wrapper for StrongType, so defining is just a matter of: using MeshQueueId = tt::stl::StrongType<uint32_t, struct MeshQueueIdTag>;

tt_metal/api/tt-metalium/mesh_event.hpp Outdated Show resolved Hide resolved
#include "mesh_device.hpp"

namespace tt::tt_metal::distributed {
using LogicalDeviceRange = CoreRange;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should live elsewhere... Also can you add a TODO for me to switch this over to a typed DeviceRange? This is the issue #17477

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to mesh_device_view.hpp , where coordinate systems are currently defined, with a TODO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this here in favor of the other one?

tt::stl::Span<const SubDeviceId> sub_device_ids = {},
const std::optional<LogicalDeviceRange>& device_range = std::nullopt);

void EnqueueRecordEventToHost(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the distributed spec, this notifies all receivers, including host and other devices? Maybe a pair of these will make it more clear:

// Notifies all receives, including the host, on event completion.
EnqueueRecordEvent(...);

// Notifies all receivers on the device local CQ.
EnqueueRecordLocalEvent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec is unclear when it mentions this: EnqueueRecordEventToHost: Have a CQ notify all receivers (including Host) of event completion.

We don't have device to device event notifications today - a device either records an event locally or sends it back to host. I was trying to differentiate the two by explicitly informing the user that EnqueueRecordEventToHost will write an event to host, which is a heavier task than recording it locally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful info that would be good to add as comments on the APIs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document some of the other APIs as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll document all TT-Mesh APIs similar to what we do with host_api.hpp once functional parity is achieved.

tt_metal/distributed/mesh_command_queue.cpp Outdated Show resolved Hide resolved
tt_metal/impl/dispatch/dispatch_query_manager.cpp Outdated Show resolved Hide resolved
tt_metal/impl/event/dispatch.cpp Outdated Show resolved Hide resolved
@tt-asaigal tt-asaigal force-pushed the asaigal/mesh_event branch 2 times, most recently from 770d726 to 59c0606 Compare February 6, 2025 22:16
dispatch_core_placement_t& assignment = this->dispatch_core_assignments[device_id][channel][cq_id];
return assignment.dispatcher_d.has_value();
}

const tt_cxy_pair& dispatch_core_manager::dispatcher_d_core(chip_id_t device_id, uint16_t channel, uint8_t cq_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify - this method checks if the core is allocated, and if not allocates it? The API looks as if it is just a getter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps 2 methods that return an optional<tt_cxy_pair>, and the second one that explicitly allocates the core would be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we need explicit behaviour here. For the accessor/modifier I added to this PR, I'm following the convention used for all other queries. I think we should have separate work for cleaning up these APIs in general.

#include "mesh_device.hpp"

namespace tt::tt_metal::distributed {
using LogicalDeviceRange = CoreRange;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this here in favor of the other one?

Comment on lines +49 to +56
for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) {
for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) {
readback_vecs.push_back({});
auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x));
ReadShard(
mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omilyutin-tt does it make sense to add some logic for EnqueueReadMeshBuffer for replicated path so we can cleanup some of this scaffolding?

    TT_FATAL(
        buffer->global_layout() == MeshBufferLayout::SHARDED, "Can only read a Sharded MeshBuffer from a MeshDevice.");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "replicated" and "sharded" should be property of the write API, not the buffer itself. Is it possible to mutate the data on each shard after the fact (so you replicate initial data, mutate it, then read back individual shards)? Let's chat on this separately, I think we can come up with a much cleaner model for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agreed, we can make this cleaner and it'll help our own testing.

std::vector<std::vector<uint32_t>> readback_vecs = {};
std::shared_ptr<MeshEvent> event = std::make_shared<MeshEvent>();
// Writes on CQ 0
EnqueueWriteMeshBuffer(mesh_device_->mesh_command_queue(0), buf, src_vec);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omilyutin-tt for sharding, we have a way of specifying subset of devices. We don't have similar expressiveness for replication. Is there someone we can add for our metal testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, let's chat on this offline

tt::stl::Span<const SubDeviceId> sub_device_ids = {},
const std::optional<LogicalDeviceRange>& device_range = std::nullopt);

void EnqueueRecordEventToHost(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful info that would be good to add as comments on the APIs

tt::stl::Span<const SubDeviceId> sub_device_ids = {},
const std::optional<LogicalDeviceRange>& device_range = std::nullopt);

void EnqueueRecordEventToHost(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document some of the other APIs as well?

tests/tt_metal/distributed/test_mesh_events.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@omilyutin-tt omilyutin-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments left, thanks!

tt_metal/impl/dispatch/dispatch_query_manager.cpp Outdated Show resolved Hide resolved
tt_metal/impl/event/dispatch.cpp Outdated Show resolved Hide resolved
tt_metal/impl/event/dispatch.cpp Outdated Show resolved Hide resolved
tt_metal/impl/event/dispatch.cpp Show resolved Hide resolved
dispatch_core_placement_t& assignment = this->dispatch_core_assignments[device_id][channel][cq_id];
return assignment.dispatcher_d.has_value();
}

const tt_cxy_pair& dispatch_core_manager::dispatcher_d_core(chip_id_t device_id, uint16_t channel, uint8_t cq_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps 2 methods that return an optional<tt_cxy_pair>, and the second one that explicitly allocates the core would be cleaner.

std::vector<std::vector<uint32_t>> readback_vecs = {};
std::shared_ptr<MeshEvent> event = std::make_shared<MeshEvent>();
// Writes on CQ 0
EnqueueWriteMeshBuffer(mesh_device_->mesh_command_queue(0), buf, src_vec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, let's chat on this offline

Comment on lines +49 to +56
for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) {
for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) {
readback_vecs.push_back({});
auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x));
ReadShard(
mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "replicated" and "sharded" should be property of the write API, not the buffer itself. Is it possible to mutate the data on each shard after the fact (so you replicate initial data, mutate it, then read back individual shards)? Let's chat on this separately, I think we can come up with a much cleaner model for this.

Comment on lines +49 to +56
for (std::size_t logical_x = 0; logical_x < buf->device()->num_cols(); logical_x++) {
for (std::size_t logical_y = 0; logical_y < buf->device()->num_rows(); logical_y++) {
readback_vecs.push_back({});
auto shard = buf->get_device_buffer(Coordinate(logical_y, logical_x));
ReadShard(
mesh_device_->mesh_command_queue(1), readback_vecs.back(), buf, Coordinate(logical_y, logical_x));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agreed, we can make this cleaner and it'll help our own testing.

tt_metal/distributed/mesh_command_queue.cpp Show resolved Hide resolved
tt_metal/distributed/mesh_command_queue.cpp Show resolved Hide resolved
tt_metal/distributed/mesh_command_queue.cpp Show resolved Hide resolved
 - Natively support Host <-> MeshCQ and MeshCQ <-> MeshCQ
   synchronization in TT-Mesh
 - Enable users to access up to 2 MeshCQs through MeshDevice
 - Add event synchronization APIs to distributed.hpp as per
   the spec
 - Share command assembly related to event APIs between MeshCQ
   and HardwareCommandQueue
 - With all core TT-Metal functionality added to TT-Mesh, the
   MeshCQ no longer relies on the single device HardwareCommandQueue
   to be available or initialized
 - Remove all bookkeeping done in MeshCQ to maintain shared state with
   HardwareCommandQueue
 - Add MeshEvent tests
 - Minor fixup for sending go signals to devices not involved in a
   MeshWorkload when SubDevices are loaded
@tt-asaigal tt-asaigal merged commit d54089c into main Feb 7, 2025
11 checks passed
@tt-asaigal tt-asaigal deleted the asaigal/mesh_event branch February 7, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants