Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Light Metal capture/replay initial changes to tt-metal for some workloads #16573

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kmabeeTT
Copy link
Contributor

@kmabeeTT kmabeeTT commented Jan 9, 2025

Ticket

[Feature Request] Add Light Metal capture/replay initial changes to tt-metal for some workloads #17039

What's changed

This is initial/bootstrapping changes (infra, tracing of ~16 Host APIs on host, device, unit tests e2e) for new feature "Light Metal capture/replay" that can trace host + device portion of workload and replay from standalone binary (tests do this, standalone runner exists too), will be built up on over next few months, and is to position ourselves favorably in automotive sector by having a way to pre-compile AI workload and run on-premise with no dependencies (one day). Parent ticket for that feature for little bit more context is is:

[Feature Request] Light Metal Feature parent/tracking ticket #17037

For more detailed overview of changes (and caveat/restrictions) of changes here in this first PR, see below comment link and/or commit descriptions for slightly more detailed changes, or docs (.rst files) added for new APIs I added.

More Details of Features/Changes here in this PR

As mentioned in that link, I left several TODO(kmabee) items in code for things I want to fix next, this isn't comprehensive, and so I marked new APIs with disclaimer Note: This LightMetal Trace feature is currently under active development and is not fully supported, use at own risk.

Notes / Explicit Callouts

Misc

@kmabeeTT - I am going to update commit descriptions with issue number once approvals come and before merge, to reduce noise in ticket due to rebasing often. Will also sqush a couple more of the recent commits, left them unsquashed to be more visible for review.

@kmabeeTT - Been rebasing often (sometimes painful) past month, currently rebased on Jan22, so I am very recent. Prefer to get this in and address any non-blocking concerns in follow up PR please.

use of detail::TraceDescriptor in host_api LoadTrace():

@ayerofieiev-tt / others - One of my new APIs used only at "replay time" is called LoadTrace() and takes a detail::TraceDescriptor object (to read it from flatbuffer binary and write it to device) which I understand is previously meant to be internal, Artem flagged this in open discussion on this PR already. Not sure what proper fix is we like (please let me know) but nothing is broken right now with this, so would prefer to modify in follow up commit. Couple options may include:

  • Move TraceDescriptor out of detail and into public API
  • Make it opaque by bouncing it through a void*, or some other API handle type to make it at least slightly more type safe

Public API header refactoring

I recently rebased on @afuller-TT change, and moved some of my hpp files under tt_metal/api/tt-metalium/ but in honesty am not sure if I got it 100% right. I left a few inside tree, almost arbitrartily (wasn't sure how to decide whether they should be moved). If nothing major, prefer to address afterwards.

-rw-r--r-- 1 kmabee  11K Jan 23 05:05 tt_metal/impl/flatbuffer/program_types_to_flatbuffer.hpp
-rw-r--r-- 1 kmabee 5.8K Jan 23 05:05 tt_metal/impl/flatbuffer/program_types_from_flatbuffer.hpp
-rw-r--r-- 1 kmabee 3.0K Jan 23 05:05 tt_metal/impl/flatbuffer/buffer_types_to_flatbuffer.hpp
-rw-r--r-- 1 kmabee 2.5K Jan 23 05:05 tt_metal/impl/flatbuffer/buffer_types_from_flatbuffer.hpp
-rw-r--r-- 1 kmabee 7.9K Jan 23 05:05 tt_metal/impl/flatbuffer/base_types_to_flatbuffer.hpp
-rw-r--r-- 1 kmabee 7.1K Jan 23 05:05 tt_metal/impl/flatbuffer/base_types_from_flatbuffer.hpp

Schema backward-compability

@omilyutin-tt - nothing done here for now, will continue to think about it in next steps.

Checklist

No known fails.

@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch from 7b7b573 to a8d6ea2 Compare January 10, 2025 18:42
@kmabeeTT
Copy link
Contributor Author

Forced push to address 4/8 bullets above.

@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch from a8d6ea2 to 494b285 Compare January 13, 2025 03:11
Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Kyle this looks great to me! Few minor nits inline.

tt_metal/impl/buffers/circular_buffer_types.hpp Outdated Show resolved Hide resolved
tt_metal/impl/lightmetal/binary.fbs Outdated Show resolved Hide resolved
tt_metal/impl/lightmetal/host_api_capture_helpers.hpp Outdated Show resolved Hide resolved
tt_metal/impl/lightmetal/host_api_capture_helpers.hpp Outdated Show resolved Hide resolved
tt_metal/impl/lightmetal/host_api_capture_helpers.hpp Outdated Show resolved Hide resolved
tt_metal/impl/lightmetal/lightmetal_replay.cpp Outdated Show resolved Hide resolved
@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch 3 times, most recently from 07cc235 to 5ce01a8 Compare January 15, 2025 16:42
@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch from 5ce01a8 to 5dc94d7 Compare January 16, 2025 03:12
@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch from 5dc94d7 to 4554497 Compare January 16, 2025 03:52
@kmabeeTT
Copy link
Contributor Author

I addressed all bullet points I planned on here for initial changes and cleaned up all the FIXMES last few days, and Nick's feedback, rebased on latest tt-metal that we're using in tt-mlir from today.

@@ -169,6 +169,11 @@ class MeshDevice : public IDevice, public std::enable_shared_from_this<MeshDevic
uint32_t get_trace_buffers_size() const override;
void set_trace_buffers_size(uint32_t size) override;

// Light Metal
void light_metal_begin_capture() override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please consider to return a guard here instead that will end the trace in destructor unless it was ended explicitly by a call to guard?

Copy link
Contributor Author

@kmabeeTT kmabeeTT Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A RAII guard? I considered it after your comment, but it complicates things here especially when LightMetalBeginCapture() and LightMetalEndCapture() are called from different functions and would go out of scope (setup and teardown functions in a test fixture, or added to command-stream from tt-forge-fe at different points in time). Currently my begin_capture() and end_capture() functions mimic how metal-trace begin/end APIs exist, would prefer them to stay ~aligned.

Side note - these APIs will move out of device class after suggestion from @cfjchu in meeting last week. (Edit: Done)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thank you!

// Light Metal
void light_metal_begin_capture() override;
std::vector<uint8_t> light_metal_end_capture() override;
void load_trace(const uint8_t cq_id, const uint32_t tid, detail::TraceDescriptor& trace_desc) override;
Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does tid argument stands for? I assume its trace_id. It is not clear where does one get a value for it.

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are exposing something from detail namespace here. Usually detail things stay unexposed. But this has to be exposed to users.

Copy link
Contributor Author

@kmabeeTT kmabeeTT Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep tid is trace_id, it matches same variable name used in all the other metal-trace related functions above this. This function is called only by "light metal binary replay library" via host_api.hpp LoadTrace() function of same signature which has documentation, and tid is taken from the light-metal-binary, which was originally captured when the binary was created (ie. a trace_id and TraceDescriptor was captured at capture time, and then loaded to device at replay time here).

About detail::TraceDescriptor, yeah I don't love that, but I don't see another way to propagate the struct from the binary back into the device through an API. I could move the LoadTrace() API out of host_api.hpp and into a lightmetal_replay_utils.hpp file (it feels like an organizational change, not sure if it helps).

cmake/flatbuffers.cmake Outdated Show resolved Hide resolved
tt_metal/impl/device/device.hpp Outdated Show resolved Hide resolved
tt_metal/impl/tracehost/command.fbs Outdated Show resolved Hide resolved
namespace tt.target;

enum Arch: uint {
Grayskull = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC if the enum value isn't present in the flatbuffer, the C++ API returns the default value, which is the first enum here.

If so, all of enums should have the first value as Invalid, so that in the code we don't accidentally confuse the absence of value with the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something briefly discussed with Nick before, tagging him @nsmithtt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omilyutin-tt one pushback on this is that imo the introduction of a sentinel value in the API communicates that Invalid has some kind of meaning, but under no circumstances would Invalid be a legal programming for Arch name. We should assert that arch == the runtime device arch which should handle all potential mis-programming of this value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just an example, but @omilyutin-tt was referring to all enums. However, in prior chats with @nsmithtt, that we ideally want to not support backward compatability, avoid it's complexities (should help with FuSa) and (not today, but eventually) just ensure capture + replay schema/changelistmatch. Once we have that, could take it a step further, and ensure same flatbuffer and C++ so that we could one day use faster static_cast and avoid all the to_flatbuffer() and from_flatbuffer() conversion functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was just one example. The goal isn't solely for backwards compatibility, also to ensure integrity of the serialized values (distinguish "unset" fields). I found this guidance for protobufs that elaborates on the point https://protobuf.dev/best-practices/dos-donts/#unspecified-enum. However I think flatbuffers is a bit different here in that it will return a value that might not be known to the old client, instead of the default?

My perspective here is that we should treat these serialized buffers as untrusted, so we should be careful about making assumptions here. Especially since we are feeding them into runtime for execution:)

Regarding backward compatibility. It may not be an issue for lightmetal, however the serialization primitives we are defining here will be more general. I imagine we will use these for communicating with distributed executors, for example, and I'm not sure we can ignore backward compatibility in that world.

Having said that, we aren't at that stage, I agree that we should not overthink these problems. I assume the serialization format we are defining here can be broken, if necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thansk @omilyutin-tt - You bring up some good points here, I think it's worthy of revisiting in seperate issue with discussion, but plan to skip it for purposes of this PR (initial bootstrap changes, to build upon, only used by me for short term).

tt_metal/impl/tracehost/types.fbs Outdated Show resolved Hide resolved
@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch from 4554497 to 38a3cd8 Compare January 20, 2025 05:56
@kmabeeTT
Copy link
Contributor Author

kmabeeTT commented Jan 20, 2025

Rebased on latest main eef7b86 from this morning (resolved bunch of public API reorg conflicts, moved some of my hpp files to public api folder), and removed light_metal_begin/end_capture() from device class, was not necessary there. And flipped default value of compile time define from build_metal.sh to make feature enabled by default.

@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch 3 times, most recently from 3aefcf3 to a7c19da Compare January 22, 2025 18:41
@kmabeeTT
Copy link
Contributor Author

kmabeeTT commented Jan 22, 2025

Fixed few things found from running CI for first time (g++12 warning, code-analysis job, docs include missing, blackhole tests) and squashed into original commits, now it's clean. Need to rebase again (conflicts), and revisit couple more open convos. (Edit: Done, rebased on latest 09349dd, closed a convo above)

@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch from a7c19da to 0814699 Compare January 23, 2025 02:04
…r few host APIs

 - This new "Light Metal Capture/Replay" feature is being built up still is aiming
   to provide infra to capture (host+device) workload to binary/disk and replay
   from binary (tt-metal for now, ttnn eventually).  The device side leverages
   "metal trace" feature and will serialize/deserialize/replay TraceDescriptors.

 - New Host APIs added to enable
   1. LightMetalBeginCapture() - start light metal binary capture
   2. LightMetalEndCapture() - end light metal binary capture and return blob to user
   3. LoadTrace() - load a TraceDescriptor to device (replay will use this API)

 - Instrument + support capture/replay few Host APIs: EnqueueTrace(), ReplayTrace(), LoadTrace()
 - Bulk of the work for this feature so far is Serialization/Deserialization
   of traced Host API calls to Flatbuffer binary.  Flatbuffer schema (.fbs files)
   defines the format, and ToFlatbuffer(), FromFlatbuffer() helper funcs convert.
 - Objects like Programs, Kernels, Buffers etc will be assigned unique
   global_id, and referred to by their global_id in capture + replay.
 - Capture code is in lightmetal_capture.cpp and replay is lightmetal_replay.hpp
 - standalone CLI runner lightmetal_runner to run binary from disk
 - Unrelated - Change trace_buffer.hpp to use fwd decl Buffer instead of buffer.hpp incl
   to reduce dependencies in lightmetal.cpp

 - When C++ define TT_ENABLE_LIGHT_METAL_TRACE!=1, host APIs not traced
   and also LightMetalBeginCapture() / LightMetalEndCapture() are NOPs

 - Add CMAKE_POSITION_INDEPENDENT_CODE=ON (-fPIC) for flatbuffer to avoid
   linking errors when lifting to tt-mlir project.
 - Use -Wno-restrict for flatbuffers compile to supress warning in g++12 build

 - PRCleanup: Remove light_metal_begin/end_capture() from device class, was not necessary
 - PRCleanup: Introduce LightMetalBinary type and use it instead of vector<uint8_t>
 - PRCleanup: Put fbs generated headers in flatbuffers subdir
 - PRCleanup: Make all FBS schema/structs to be tt::ttmetal::flatbuffer namespace
 - Not comprehensive, just initial changes for some of these paths:

   CreateBuffer()
   EnqueueWriteBuffer()
   EnqueueReadBuffer()
   Finish()
   DeallocateBuffer
   ReleaseTrace()
   CreateProgram()
   EnqueueProgram()
   CreateKernel()
   SetRuntimeArgs(uint32)
   SetRuntimeArgs(Kernel,RuntimeArgs)
   CreateCircularBuffer()

 - When Metal Trace is enabled, don't capture EnqueueProgram(), instead
   inject ReplayTrace(), would be used alongside LoadTrace()
 - Serialization / Deserialization of structs/enums defined in
   flatbuffer schema types.fbs handled via ToFlatbuffer() and
   FromFlatbuffer() functions.
 - Workaround in captureDeallocateBuffer() to ignore if CreateBuffer()
   not used (program binaries, trace buffer)
 - Fix clang-tidy build fails, use const shared_ptr<>& in Add*ToMap() functions

 - PRCleanup: Make all FBS schema/structs to be tt::ttmetal::flatbuffer namespace
…lBinary capture+replay e2e testing

 - Metal trace tests (single op, two op) creates metal traces at capture time, runs at replay time
 - CI job passes for grayskull, wormhole, blackhole as expected
 - Use trace_buffer_size=4KB during capture/replay since BH needs it
…e host API tracing

 - Add TRACE_FUNCTION_ENTRY() at the very start of function that is
   traced with TRACE_FUNCTION_CALL() to increment scope guard counter.

 - Really liked single macro usage per trace function, but some APIs
   like EnqueueProgram() and CreateDevice() (not currently traced, maybe
   one day) call other host APIs

 - Can't bundle with existing TRACE_FUNCTION_CALL() macro because
   sometimes it's called at end of traced function (when it needs to
   capture the return object) rather than beginning
…et cmake/C++ define TT_ENABLE_LIGHT_METAL_TRACE=OFF/0

 - Keep "Light Metal Tracing" feature compile-time define set by default
   for simpler testing, but have quick option here to disable it if needed
   from perspective of host API "Capture" functions, and also the Light Metal
   enable/disable newly added APIS, they all become compile time NOP
…eToGolden() for verif

 - Put them in lightmetal_capture_utils.hpp since they are purely used
   at capture time, and not worthy enough to be inside host_api.hpp
   since just for verif.

 - Update test_lightmetal_sanity.cpp tests to use these API for
   functional correctness checking between capture + replay.
…m_flatbuffer.hpp

 - Move all fbs files into flatbuffer folder (was split between lightmetal, tracehost)
 - Split types.fbs into somewhat reasonable grouping of
   base_types.fbs, buffer_types.fbs and program_types.fbs, and do the
   same for to/from_flatbuffer.hpp files
@kmabeeTT kmabeeTT force-pushed the kmabee/lightmetal_capture_replay_wip branch from 0814699 to 65138f4 Compare January 23, 2025 05:03
@kmabeeTT kmabeeTT changed the title LightMetal Capture/Replay bootstrapping changes Add Light Metal capture/replay initial changes to tt-metal for some workloads Jan 23, 2025
@kmabeeTT kmabeeTT self-assigned this Jan 23, 2025
@kmabeeTT kmabeeTT changed the base branch from kmabee/base_branch to main January 23, 2025 20:45
@kmabeeTT kmabeeTT marked this pull request as ready for review January 23, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants