#0: Add subheaders for metal trace/multiple command queues tech report

tenstorrent · Sep 10, 2024 · e789c24 · e789c24
1 parent 9b402f7
commit e789c24
Showing 1 changed file with 50 additions and 6 deletions.
diff --git a/...ncedPerformanceOperationsForModels/AdvancedPerformanceOptimizationsForModels.md b/...ncedPerformanceOperationsForModels/AdvancedPerformanceOptimizationsForModels.md
@@ -1,10 +1,34 @@
 # Advanced Performance Optimizations for Models
 
-1. [Metal Trace](#metal-trace)
-2. [Multiple Command Queues](#multiple-command-queues)
-3. [Putting Trace and Multiple Command Queues Together](#putting-trace-and-multiple-command-queues-together)
+## Contents
 
-## Metal Trace
+1. [Metal Trace](#1-metal-trace)
+
+    1.1 [Overview](#11-overview)
+
+    1.2 [APIs](#12-apis)
+
+    1.3 [Programming Examples](#13-programming-examples)
+
+2. [Multiple Command Queues](#2-multiple-command-queues)
+
+    2.1 [Overview](#21-overview)
+
+    2.2 [APIs](#22-apis)
+
+    2.3 [Programming Examples](#23-programming-examples)
+
+3. [Putting Trace and Multiple Command Queues Together](#3-putting-trace-and-multiple-command-queues-together)
+
+    3.1 [Overview](#31-overview)
+
+    3.2 [APIs](#32-apis)
+
+    3.3 [Programming Examples](#33-programming-examples)
+
+## 1. Metal Trace
+
+### 1.1 Overview
 
 Metal Trace is a performance optimization feature that aims to remove the host overhead of constructing and dispatching operations for models, and will show benefits when the host time to execute/dispatch a model exceeds the device time needed to run the operations.
 
@@ -18,6 +42,7 @@ With trace, we can eliminate a large portion of these gaps. In the figure below
 <!-- ![image2](images/image2.png){width=15 height=15} -->
 <img src="images/image2.png" style="width:1000px;"/>
 
+### 1.2 APIs
 In order to use trace, we need to use the following trace apis:
 
 * `trace_region_size`
@@ -48,6 +73,8 @@ In addition, since trace requires the addresses of the used tensors to be the sa
 
   This will copy data from the input host tensor to the allocated on device tensor
 
+### 1.3 Programming Examples
+
 Normally for performance we try to allocate tensors in L1, but many models are not able to fit in L1 if we keep the input tensor in L1 memory. The easiest way to work around this is that we can allocate our input in DRAM so that we can keep the tensor persistent in memory, then run an operation to move it to L1. For performance, we’d expect to allocate the input as DRAM sharded and move it to L1 sharded using the reshard operation. A more complex technique that allows us to still run with our input in L1 is to allow the input tensor to be deallocated, and then reallocating and ensuring it is allocated at the same address at the end of the trace. Both techniques will be demonstrated below.
 
 Trace only supports capturing and executing operations, and not other commands such as reading/writing tensors. This also means that to capture the trace of a sequence of operations, we must run with program cache and have already compiled the target operations before we capture them.
@@ -111,7 +138,9 @@ host_output_tensor = output_tensor.cpu(blocking=False)
 
 ```
 
-## Multiple Command Queues
+## 2. Multiple Command Queues
+
+### 2.1 Overview
 
 Metal supports multiple command queues for fast dispatch, up to two queues. These command queues are independent of each other and allow for us to dispatch commands on a device in parallel. Both command queues support any dispatch command, so we use either for I/O data transfers or launching programs on either command queue. As these queues are independent of each other, to coordinate and guarantee command order such as having one queue be used to write the input and the other queue be used to run operations on that input, we need to use events to synchronize between the queues. A common setup for multiple command queues is to have one only responsible for writing inputs, while the other command queue is used for dispatching programs and reading back the output, which is what will be described in this document. This is useful where we are device bound and our input tensor takes a long time to write, and allows us to overlap dispatching of the next input tensor with the execution of the previous model run. Other setups are also possible, such as having one command queue for both writes and reads, while the other is used for only dispatching programs, or potentially having both command queues running different programs concurrently.
 
@@ -123,6 +152,8 @@ Using a second command queue only for writes enables us to eliminate the gap bet
 <!-- ![image4](images/image4.png){width=15 height=15} -->
 <img src="images/image4.png" style="width:1000px;"/>
 
+### 2.2 APIs
+
 In order to use multiple command queues, we need to be familiar with the following apis:
 
 * `num_hw_cqs`/`num_command_queues` (Currently dependent on whether we are running with single or multi device fixture)
@@ -149,6 +180,8 @@ In addition, for our example of using one command queue only for writing inputs,
 
   This will copy data from the input host tensor to the allocated on device tensor
 
+### 2.3 Programming Examples
+
 Normally for performance we try to allocate tensors in L1, but many models are not able to fit in L1 if we keep the input tensor in L1 memory. To work around this, we can allocate our input in DRAM so that we can keep the tensor persistent in memory, then run an operation to move it to L1. For performance, we’d expect to allocate the input as DRAM sharded and move it to L1 sharded using the reshard operation.
 
 For using 2 command queues where one is just for writes, and one for running programs and reading, we will need to create and use 2 events.
@@ -189,7 +222,9 @@ for iter in range(0, 2):
 
 ```
 
-## Putting Trace and Multiple Command Queues Together
+## 3. Putting Trace and Multiple Command Queues Together
+
+### 3.1 Overview
 
 This section assumes that you are familiar with the contents and apis described in [Metal Trace](#metal-trace) and [Multiple Command Queues](#multiple-command-queues).
 
@@ -202,6 +237,15 @@ When combining these two optimizations, there are a few things we need to be awa
 * Trace cannot capture events, so we do not capture the events or the consumer ops of the input tensor since we need to enqueue event commands right after
 * Because the input to trace is from the output of an operation, and we want the output to be in L1 we need to be ensure and assert that the location this output gets written to is a constant address when executing the trace
 
+### 3.2 APIs
+
+Refer to [1.2 Metal Trace APIs](#12-apis) and [2.2 Multiple Command Queues APIs](#22-apis) for the required APIs.
+
+### 3.3 Programming Examples
+
+The following example shows using 2 cqs with trace, where we use CQ 1 only for writing inputs, and CQ 0 for running programs/reading outputs.
+We use a persistent DRAM tensor to write our input, and we make the input to our trace as an L1 tensor which is the output of the first op.
+
 ```py
 # This example uses 1 CQ for only writing inputs (CQ 1), and one CQ for executing programs/reading back the output (CQ 0)