Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dma_task in programming examples #1919

Merged
merged 48 commits into from
Nov 22, 2024
Merged
Changes from 1 commit
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
b6d7180
Start to port programming examples to use dma task
hunhoffe Nov 13, 2024
49fa9e1
remove unneeded field
hunhoffe Nov 13, 2024
b43a48d
Finish adding alternate (dma task) impls of programming_examples/vision
hunhoffe Nov 13, 2024
fb59e20
Add alt version for ml programming examples
hunhoffe Nov 13, 2024
bbc991f
Add convenience wrappers around dma_*_task functions
hunhoffe Nov 13, 2024
2bfb325
Start porting some of the basic examples to use the dma task structure
hunhoffe Nov 13, 2024
88a033f
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 13, 2024
774e0e6
Finish rewriting programming examples to use dma task
hunhoffe Nov 13, 2024
f95bf36
Fix for [1, 1, 1, N]
jgmelber Nov 14, 2024
174cdf0
Additional verification linear case patch
jgmelber Nov 14, 2024
bbcde4c
Default sizes to 1
jgmelber Nov 14, 2024
0d86176
Use uint32_t for sizes to match transfer length for dim 0
jgmelber Nov 14, 2024
c7f38ec
Apply suggestions from code review
jgmelber Nov 14, 2024
daa4598
Revert "Default sizes to 1"
jgmelber Nov 14, 2024
259b2a6
Init sizes for vec scalar mul
jgmelber Nov 14, 2024
10b386b
calculate transfer len with less lines of code
hunhoffe Nov 14, 2024
d3ee4e6
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 14, 2024
efbff0e
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 14, 2024
ee0fa1a
Remove lingering npu_dma_memcpy_nd from alt examples
hunhoffe Nov 14, 2024
5931e38
Attempt to use repeat count correctly in examples
hunhoffe Nov 15, 2024
22ec2f0
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 15, 2024
bdc73f0
Does not fix things, but update understanding of repeat count
hunhoffe Nov 15, 2024
2837740
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 19, 2024
bb39927
Add large linear transfer test for large linear transfer (size used i…
hunhoffe Nov 19, 2024
6bb74fc
Fix minor errors with some alt examples
hunhoffe Nov 19, 2024
d0e4997
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 19, 2024
84ec074
Reduce diff between normal and alt version
hunhoffe Nov 19, 2024
339592f
Fix a few more typos
hunhoffe Nov 19, 2024
f38cdbe
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 20, 2024
6873582
Do not check for linear transfer until after setting sizes
hunhoffe Nov 20, 2024
1d09301
Zero out sides/strides for linear transfer
hunhoffe Nov 20, 2024
79b1e47
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 20, 2024
99fab89
Some prep for larger lens for DMABDOps
hunhoffe Nov 20, 2024
7cb6228
Try fixing vector exp build error
hunhoffe Nov 20, 2024
416af28
matrix vector working locally
hunhoffe Nov 20, 2024
3611676
Revert vector exp change
hunhoffe Nov 20, 2024
6256719
Another attempt to fix vector exp
hunhoffe Nov 20, 2024
384e213
small fix to cascade alt design
hunhoffe Nov 20, 2024
e5b6f10
Small fix, cascade working locally
hunhoffe Nov 20, 2024
d29571a
Start porting examples to use helper function
hunhoffe Nov 20, 2024
dc28e5c
Continue porting examples to use helper function
hunhoffe Nov 21, 2024
cdd28ec
Finish porting basic alt examples to use helper function
hunhoffe Nov 21, 2024
6d38385
Continue fixing up examples
hunhoffe Nov 22, 2024
8e96a57
Finished cleaning up alt examples
hunhoffe Nov 22, 2024
a5eda19
Merge branch 'main' into port-examples-dma-task
hunhoffe Nov 22, 2024
4e11994
Add some documentation to the programming guide regarding DMA task op…
hunhoffe Nov 22, 2024
15649f9
Commit improvements to dma_task section of programming guide
hunhoffe Nov 22, 2024
4727379
Minor formatting fixes in section-2g
hunhoffe Nov 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 75 additions & 4 deletions programming_guide/section-2/section-2g/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ npu_dma_memcpy_nd(metadata, bd_id, mem, offsets=None, sizes=None, strides=None)
- **`mem`**: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.
- **`offsets`** (optional): Start points for data transfer in each dimension. There is a maximum of four offset dimensions.
- **`sizes`**: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.
- **`strides`** (optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data. There is a maximum of three stride dimensions that can be expressed because dimension 0 is an implicit stride of 1 4B element.
- **`strides`** (optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.

It is important to note that dimension 0 of the **`sizes`** and all **`strides`** are expressed in a 4B granularity. Higher dimensions of the **`sizes`** are integers to repeat the lower dimensions. The **`offsets`** are expressed in multiples of the **`sizes`**, however the dimension 0 offset is in a 4B granularity. The strides and wraps express data transformations analogously to those described in [Section 2C](../section-2c).
The strides and sizes express data transformations analogously to those described in [Section 2C](../section-2c).

**Example Usage**:
```python
Expand Down Expand Up @@ -107,7 +107,7 @@ offsets = [0, 0, 0, 0]
npu_dma_memcpy_nd(metadata, bd_id, mem, offsets, sizes, strides)
```

#### **Host Synchronization with `dma_wait`**
#### **Host Synchronization with `dma_wait` after one or more `npu_dma_memcpy_nd` operations**

Synchronization between DMA channels and the host is facilitated by the `dma_wait` operation, ensuring data consistency and proper execution order. The `dma_wait` operation waits until the BD associated with the ObjectFifo is complete, issuing a task complete token.

Expand All @@ -130,12 +130,83 @@ Waiting on DMAs associated with more than one object fifo:
dma_wait(of_in, of_out)
```

#### **Best Practices for Data Movement and Synchronization**
#### **Best Practices for Data Movement and Synchronization with `npu_dma_memcpy_nd`**

- **Sync to Reuse Buffer Descriptors**: Each `npu_dma_memcpy_nd` is assigned a `bd_id`. There are a maximum of `16` BDs available to use in each Shim Tile. It is "safe" to reuse BDs once all transfers are complete, this can be managed by properly synchronizing taking into account the BDs that must have completed to transfer data into the array to complete a compute operation. And then sync on the BD that receives the data produced by the compute operation to write it back to host memory.
- **Note Non-blocking Transfers**: Overlap data transfers with computation by leveraging the non-blocking nature of `npu_dma_memcpy_nd`.
- **Minimize Synchronization Overhead**: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.

#### **Efficient Data Movement with `dma_task` Operations**

As an alternative to `npu_dma_memcpy_nd` and `dma_wait`, there is a series of operations around **DMA tasks** that can serve a similar purpose.

There are two advantages of using the DMA task operations over using `npu_dma_memcpy_nd`:
* The user does not have to specify a BD number
* DMA task operations are capable of *chaining* BD operations; however, this is an advance use-case beyond the scope of this guide.

All programming examples have an `*_alt.py` version that is written using DMA task operations.

**Function Signature and Parameters**:
```python
def shim_dma_single_bd_task(
alloc,
mem,
tensor_tile: TensorTile | None = None,
offset: int | None = None,
sizes: MixedValues | None = None,
strides: MixedValues | None = None,
transfer_len: int | None = None,
issue_token: bool = False,
)
```
- **`alloc`**: This is a reference to the object FIFO or the string name of an object FIFO that records a Shim Tile and one of its DMA channels allocated for the host-side memory transfer. In order to associate the memcpy operation with an object FIFO, this metadata string needs to match the object FIFO name string.
- **`mem`**: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.
- **`tensor_tile`** (optional): An alternative method to `offset`/`sizes`/`strides` for determining an access pattern over the `mem` buffer.
- **`offset`** (optional): Starting point for the data transfer. Default values is `0`.
- **`sizes`**: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.
- **`strides`** (optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.
- **`issue_token`** (optional): If a token is issued, one may call `dma_await_task` on the returned task. Default is `false`.

The strides and strides express data transformations analogously to those described in [Section 2C](../section-2c).

**Example Usage**:
```python
out_task = shim_dma_single_bd_task(of_out, C, sizes=[1, 1, 1, N], issue_token=True)
```

The example above describes a linear transfer of `N` data elements from the `C` buffer in host memory into an object FIFO with matching metadata labeled "of_out". The `sizes` dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to `1`.

#### **Host Synchronization with `dma_await_task`**

Synchronization between DMA channels and the host is facilitated by the `dma_await_task` operations, ensuring data consistency and proper execution order. The `dma_await_task` operation waits until the BD associated with the task is complete, issuing a task complete token.

**Function Signature**:
```python
def dma_await_task(*args: DMAConfigureTaskForOp)
```
- **`args`: One or more `dma_task` objects, where `dma_task` objects are the value returned by `shim_dma_single_bd_task`.

**Example Usage**:

Waiting on DMAs associated with one object fifo:
```python
# Waits for the output data to transfer from the output object fifo to the host
dma_await_task(out_task)
```

Waiting on DMAs associated with more than one object fifo:
```python
dma_await_task(in_task, out_task)
```

`dma_await_task` can only be called on a task created with `issue_token=True`. If `issue_token=False` (which is default), then `dma_free_task` should be called when the programmer knows that task if complete.

#### **Best Practices for Data Movement and Synchronization with `dma_task` Operations**

- **Await or Free to Reuse Buffer Descriptors**: While the exact buffer descriptor (BD) used for each operation is not visible to the user with the `dma_task` operations, there are still a finite number (maximum of `16` on a Shim Tile). Thus, it is important to use `dma_await_task` or `dma_free_task` before the number of BDs are exhausted so that they may be reused.
- **Note Non-blocking Transfers**: Overlap data transfers with computation by leveraging the non-blocking nature of `dma_start_task`.
- **Minimize Synchronization Overhead**: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.

#### **Conclusion**

The `npu_dma_memcpy_nd` and `dma_wait` functions are powerful tools for managing data transfers and synchronization with AI Engines in the Ryzen™ AI NPU. By understanding and effectively implementing applications leveraging these functions, developers can enhance the performance, efficiency, and accuracy of their high-performance computing applications.
Expand Down
Loading