-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Light Metal capture/replay initial changes to tt-metal for some workloads #16573
base: main
Are you sure you want to change the base?
Conversation
7b7b573
to
a8d6ea2
Compare
e47add8
to
208684d
Compare
Forced push to address 4/8 bullets above. |
a8d6ea2
to
494b285
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Kyle this looks great to me! Few minor nits inline.
07cc235
to
5ce01a8
Compare
208684d
to
209cd54
Compare
5ce01a8
to
5dc94d7
Compare
209cd54
to
f21ce09
Compare
5dc94d7
to
4554497
Compare
I addressed all bullet points I planned on here for initial changes and cleaned up all the FIXMES last few days, and Nick's feedback, rebased on latest tt-metal that we're using in tt-mlir from today. |
tt_metal/distributed/mesh_device.hpp
Outdated
@@ -169,6 +169,11 @@ class MeshDevice : public IDevice, public std::enable_shared_from_this<MeshDevic | |||
uint32_t get_trace_buffers_size() const override; | |||
void set_trace_buffers_size(uint32_t size) override; | |||
|
|||
// Light Metal | |||
void light_metal_begin_capture() override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please consider to return a guard here instead that will end the trace in destructor unless it was ended explicitly by a call to guard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A RAII guard? I considered it after your comment, but it complicates things here especially when LightMetalBeginCapture() and LightMetalEndCapture() are called from different functions and would go out of scope (setup and teardown functions in a test fixture, or added to command-stream from tt-forge-fe at different points in time). Currently my begin_capture() and end_capture() functions mimic how metal-trace begin/end APIs exist, would prefer them to stay ~aligned.
Side note - these APIs will move out of device class after suggestion from @cfjchu in meeting last week. (Edit: Done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thank you!
tt_metal/distributed/mesh_device.hpp
Outdated
// Light Metal | ||
void light_metal_begin_capture() override; | ||
std::vector<uint8_t> light_metal_end_capture() override; | ||
void load_trace(const uint8_t cq_id, const uint32_t tid, detail::TraceDescriptor& trace_desc) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does tid
argument stands for? I assume its trace_id. It is not clear where does one get a value for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are exposing something from detail
namespace here. Usually detail
things stay unexposed. But this has to be exposed to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep tid
is trace_id, it matches same variable name used in all the other metal-trace related functions above this. This function is called only by "light metal binary replay library" via host_api.hpp LoadTrace()
function of same signature which has documentation, and tid is taken from the light-metal-binary, which was originally captured when the binary was created (ie. a trace_id and TraceDescriptor was captured at capture time, and then loaded to device at replay time here).
About detail::TraceDescriptor
, yeah I don't love that, but I don't see another way to propagate the struct from the binary back into the device through an API. I could move the LoadTrace()
API out of host_api.hpp and into a lightmetal_replay_utils.hpp
file (it feels like an organizational change, not sure if it helps).
tt_metal/impl/tracehost/types.fbs
Outdated
namespace tt.target; | ||
|
||
enum Arch: uint { | ||
Grayskull = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC if the enum value isn't present in the flatbuffer, the C++ API returns the default value, which is the first enum here.
If so, all of enums should have the first value as Invalid
, so that in the code we don't accidentally confuse the absence of value with the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something briefly discussed with Nick before, tagging him @nsmithtt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omilyutin-tt one pushback on this is that imo the introduction of a sentinel value in the API communicates that Invalid
has some kind of meaning, but under no circumstances would Invalid
be a legal programming for Arch name. We should assert that arch == the runtime device arch which should handle all potential mis-programming of this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was just an example, but @omilyutin-tt was referring to all enums. However, in prior chats with @nsmithtt, that we ideally want to not support backward compatability, avoid it's complexities (should help with FuSa) and (not today, but eventually) just ensure capture + replay schema/changelistmatch. Once we have that, could take it a step further, and ensure same flatbuffer and C++ so that we could one day use faster static_cast and avoid all the to_flatbuffer() and from_flatbuffer() conversion functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was just one example. The goal isn't solely for backwards compatibility, also to ensure integrity of the serialized values (distinguish "unset" fields). I found this guidance for protobufs that elaborates on the point https://protobuf.dev/best-practices/dos-donts/#unspecified-enum. However I think flatbuffers is a bit different here in that it will return a value that might not be known to the old client, instead of the default?
My perspective here is that we should treat these serialized buffers as untrusted, so we should be careful about making assumptions here. Especially since we are feeding them into runtime for execution:)
Regarding backward compatibility. It may not be an issue for lightmetal, however the serialization primitives we are defining here will be more general. I imagine we will use these for communicating with distributed executors, for example, and I'm not sure we can ignore backward compatibility in that world.
Having said that, we aren't at that stage, I agree that we should not overthink these problems. I assume the serialization format we are defining here can be broken, if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thansk @omilyutin-tt - You bring up some good points here, I think it's worthy of revisiting in seperate issue with discussion, but plan to skip it for purposes of this PR (initial bootstrap changes, to build upon, only used by me for short term).
f21ce09
to
191c41c
Compare
4554497
to
38a3cd8
Compare
Rebased on latest main eef7b86 from this morning (resolved bunch of public API reorg conflicts, moved some of my hpp files to public api folder), and removed light_metal_begin/end_capture() from device class, was not necessary there. And flipped default value of compile time define from build_metal.sh to make feature enabled by default. |
3aefcf3
to
a7c19da
Compare
Fixed few things found from running CI for first time (g++12 warning, code-analysis job, docs include missing, blackhole tests) and squashed into original commits, now it's clean. Need to rebase again (conflicts), and revisit couple more open convos. (Edit: Done, rebased on latest 09349dd, closed a convo above) |
a7c19da
to
0814699
Compare
191c41c
to
4167e29
Compare
…r few host APIs - This new "Light Metal Capture/Replay" feature is being built up still is aiming to provide infra to capture (host+device) workload to binary/disk and replay from binary (tt-metal for now, ttnn eventually). The device side leverages "metal trace" feature and will serialize/deserialize/replay TraceDescriptors. - New Host APIs added to enable 1. LightMetalBeginCapture() - start light metal binary capture 2. LightMetalEndCapture() - end light metal binary capture and return blob to user 3. LoadTrace() - load a TraceDescriptor to device (replay will use this API) - Instrument + support capture/replay few Host APIs: EnqueueTrace(), ReplayTrace(), LoadTrace() - Bulk of the work for this feature so far is Serialization/Deserialization of traced Host API calls to Flatbuffer binary. Flatbuffer schema (.fbs files) defines the format, and ToFlatbuffer(), FromFlatbuffer() helper funcs convert. - Objects like Programs, Kernels, Buffers etc will be assigned unique global_id, and referred to by their global_id in capture + replay. - Capture code is in lightmetal_capture.cpp and replay is lightmetal_replay.hpp - standalone CLI runner lightmetal_runner to run binary from disk - Unrelated - Change trace_buffer.hpp to use fwd decl Buffer instead of buffer.hpp incl to reduce dependencies in lightmetal.cpp - When C++ define TT_ENABLE_LIGHT_METAL_TRACE!=1, host APIs not traced and also LightMetalBeginCapture() / LightMetalEndCapture() are NOPs - Add CMAKE_POSITION_INDEPENDENT_CODE=ON (-fPIC) for flatbuffer to avoid linking errors when lifting to tt-mlir project. - Use -Wno-restrict for flatbuffers compile to supress warning in g++12 build - PRCleanup: Remove light_metal_begin/end_capture() from device class, was not necessary - PRCleanup: Introduce LightMetalBinary type and use it instead of vector<uint8_t> - PRCleanup: Put fbs generated headers in flatbuffers subdir - PRCleanup: Make all FBS schema/structs to be tt::ttmetal::flatbuffer namespace
- Not comprehensive, just initial changes for some of these paths: CreateBuffer() EnqueueWriteBuffer() EnqueueReadBuffer() Finish() DeallocateBuffer ReleaseTrace() CreateProgram() EnqueueProgram() CreateKernel() SetRuntimeArgs(uint32) SetRuntimeArgs(Kernel,RuntimeArgs) CreateCircularBuffer() - When Metal Trace is enabled, don't capture EnqueueProgram(), instead inject ReplayTrace(), would be used alongside LoadTrace() - Serialization / Deserialization of structs/enums defined in flatbuffer schema types.fbs handled via ToFlatbuffer() and FromFlatbuffer() functions. - Workaround in captureDeallocateBuffer() to ignore if CreateBuffer() not used (program binaries, trace buffer) - Fix clang-tidy build fails, use const shared_ptr<>& in Add*ToMap() functions - PRCleanup: Make all FBS schema/structs to be tt::ttmetal::flatbuffer namespace
…lBinary capture+replay e2e testing - Metal trace tests (single op, two op) creates metal traces at capture time, runs at replay time - CI job passes for grayskull, wormhole, blackhole as expected - Use trace_buffer_size=4KB during capture/replay since BH needs it
…e host API tracing - Add TRACE_FUNCTION_ENTRY() at the very start of function that is traced with TRACE_FUNCTION_CALL() to increment scope guard counter. - Really liked single macro usage per trace function, but some APIs like EnqueueProgram() and CreateDevice() (not currently traced, maybe one day) call other host APIs - Can't bundle with existing TRACE_FUNCTION_CALL() macro because sometimes it's called at end of traced function (when it needs to capture the return object) rather than beginning
…et cmake/C++ define TT_ENABLE_LIGHT_METAL_TRACE=OFF/0 - Keep "Light Metal Tracing" feature compile-time define set by default for simpler testing, but have quick option here to disable it if needed from perspective of host API "Capture" functions, and also the Light Metal enable/disable newly added APIS, they all become compile time NOP
…eToGolden() for verif - Put them in lightmetal_capture_utils.hpp since they are purely used at capture time, and not worthy enough to be inside host_api.hpp since just for verif. - Update test_lightmetal_sanity.cpp tests to use these API for functional correctness checking between capture + replay.
…m_flatbuffer.hpp - Move all fbs files into flatbuffer folder (was split between lightmetal, tracehost) - Split types.fbs into somewhat reasonable grouping of base_types.fbs, buffer_types.fbs and program_types.fbs, and do the same for to/from_flatbuffer.hpp files
0814699
to
65138f4
Compare
4167e29
to
a36c0eb
Compare
Ticket
What's changed
This is initial/bootstrapping changes (infra, tracing of ~16 Host APIs on host, device, unit tests e2e) for new feature "Light Metal capture/replay" that can trace host + device portion of workload and replay from standalone binary (tests do this, standalone runner exists too), will be built up on over next few months, and is to position ourselves favorably in automotive sector by having a way to pre-compile AI workload and run on-premise with no dependencies (one day). Parent ticket for that feature for little bit more context is is:
For more detailed overview of changes (and caveat/restrictions) of changes here in this first PR, see below comment link and/or commit descriptions for slightly more detailed changes, or docs (.rst files) added for new APIs I added.
More Details of Features/Changes here in this PR
As mentioned in that link, I left several TODO(kmabee) items in code for things I want to fix next, this isn't comprehensive, and so I marked new APIs with disclaimer
Note: This LightMetal Trace feature is currently under active development and is not fully supported, use at own risk.
Notes / Explicit Callouts
Misc
@kmabeeTT - I am going to update commit descriptions with issue number once approvals come and before merge, to reduce noise in ticket due to rebasing often. Will also sqush a couple more of the recent commits, left them unsquashed to be more visible for review.
@kmabeeTT - Been rebasing often (sometimes painful) past month, currently rebased on Jan22, so I am very recent. Prefer to get this in and address any non-blocking concerns in follow up PR please.
use of detail::TraceDescriptor in host_api LoadTrace():
@ayerofieiev-tt / others - One of my new APIs used only at "replay time" is called
LoadTrace()
and takes adetail::TraceDescriptor
object (to read it from flatbuffer binary and write it to device) which I understand is previously meant to be internal, Artem flagged this in open discussion on this PR already. Not sure what proper fix is we like (please let me know) but nothing is broken right now with this, so would prefer to modify in follow up commit. Couple options may include:Public API header refactoring
I recently rebased on @afuller-TT change, and moved some of my hpp files under
tt_metal/api/tt-metalium/
but in honesty am not sure if I got it 100% right. I left a few inside tree, almost arbitrartily (wasn't sure how to decide whether they should be moved). If nothing major, prefer to address afterwards.Schema backward-compability
@omilyutin-tt - nothing done here for now, will continue to think about it in next steps.
Checklist
No known fails.