Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice #16878

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cfjchu
Copy link
Collaborator

@cfjchu cfjchu commented Jan 18, 2025

Ticket

Link to Github Issue

Problem description

As part of the TT-Mesh scope of work, we want to natively virtualize the devices in the mesh. Part of this virtualization process involves deferring to a single set of allocators (global, per-subdevice) at the MeshDevice level, instead of issuing repeated allocations on each of the per-device allocators.

What's changed

Checklist

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

tests/tt_metal/distributed/test_mesh_allocator.cpp Outdated Show resolved Hide resolved
@cfjchu cfjchu force-pushed the jchu/mesh-allocator branch from 92a2dd9 to d91d8e5 Compare January 18, 2025 02:44
Copy link
Contributor

@omilyutin-tt omilyutin-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high level questions for now. In general, I think it is worth exploring a better "allocator" abstraction that is independent of details of mesh / single device / sub device.

tt_metal/distributed/mesh_buffer.cpp Show resolved Hide resolved
tt_metal/distributed/mesh_device.cpp Show resolved Hide resolved
tt_metal/distributed/mesh_device.cpp Show resolved Hide resolved
tt_metal/distributed/mesh_device.cpp Show resolved Hide resolved
tt_metal/impl/sub_device/sub_device_manager.cpp Outdated Show resolved Hide resolved
tt_metal/impl/sub_device/sub_device_manager.cpp Outdated Show resolved Hide resolved
tt_metal/distributed/mesh_device.cpp Outdated Show resolved Hide resolved
tt_metal/distributed/mesh_buffer.cpp Outdated Show resolved Hide resolved
tt_metal/distributed/mesh_device.cpp Show resolved Hide resolved
tt_metal/distributed/mesh_device.cpp Show resolved Hide resolved
tt_metal/distributed/mesh_device.cpp Show resolved Hide resolved
tt_metal/impl/sub_device/sub_device_manager.cpp Outdated Show resolved Hide resolved
@cfjchu cfjchu force-pushed the jchu/mesh-allocator branch 2 times, most recently from 5968a64 to 2f3c743 Compare January 21, 2025 23:59
@cfjchu cfjchu requested a review from davorchap as a code owner January 21, 2025 23:59
@cfjchu cfjchu force-pushed the jchu/mesh-allocator branch 3 times, most recently from 2c4783e to 89a2fd9 Compare January 23, 2025 20:17
@cfjchu cfjchu force-pushed the jchu/mesh-allocator branch from 89a2fd9 to a04ce94 Compare January 23, 2025 21:05
Copy link
Contributor

@tt-aho tt-aho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-Device related changes look okay to me. Cleanup of how we want to expose these apis/interfaces instead of having a bunch of wrapping fns is a different discussion/issue from this pr.


// Helper function to verify all devices in the MeshDevice have the same value
template <typename F>
void validate_devices_return_same_value(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the intention, can you make validate_devices_return_same_value validate and return the value? You wouldn't have to repeat the lines like return reference_device()->num_hw_cqs();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works fine for return values that are by-value but there were some issues when we defer to reference device method that needs to return const ref.

@@ -58,6 +59,9 @@ class SubDeviceManagerTracker {
// default case to not affect performance
SubDeviceManagerId get_default_sub_device_manager_id() const;

std::optional<DeviceAddr> lowest_occupied_compute_l1_address(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's aim to separate allocator from subdevices, and supply it externally instead... This method shouldn't live in the "subdevice manager tracker" abstraction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's aim to separate allocator from subdevices, and supply it externally instead...

This is an involved change that is outside the scope of this PR. Currently SubDeviceManager fully owns allocator and SubDeviceManagerTracker owns multiple SubDeviceManager objects. But some entity needs to query across allocators. This was previously done in Device and now it's in SubDeviceManagerTracker. This is more appropriate than what living on Device because otherwise we'd duplicate the implementation between Device/MeshDevice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants