Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pack Untilize Kernel Perf Optimization #20

Open
rtawfik01 opened this issue Jul 19, 2024 · 1 comment
Open

Pack Untilize Kernel Perf Optimization #20

rtawfik01 opened this issue Jul 19, 2024 · 1 comment
Assignees
Labels
Performance Feature that helps with performance, not a blocker for functionality

Comments

@rtawfik01
Copy link
Collaborator

Pack untilize kernel needed to be re-written for Blackhole, due to Blackhole packer limitations:

  1. Only one Dest offset register for the packer instance (but we can use strided mode of 16 rows apart)
  2. PACR instructions can only write L1 contiguous output.
  3. These L1 Offset registers, are now set per PACR Context, not per rows output from PACR instruction (i.e PACK_INTF_SEL)
THCON_SEC0_REG1_L1_Dest_addr_ADDR32 -> Context 0
THCON_SEC0_REG8_L1_Dest_addr_ADDR32 -> Context 1
THCON_SEC1_REG1_L1_Dest_addr_ADDR32 -> Context 2
THCON_SEC1_REG8_L1_Dest_addr_ADDR32 -> Context 3

Here is the current algorithm:
image

For wormhole b0, 8x16 rows can be output per tile (before incrementing counters), while for blackhole with the above implementation, only 2x16 can. Since pack_untilize feature needs to be fused with other operations, we cannot change the unpacker to use a different unpacking scheme (Fastest untilize scheme would be T0F0, T0F1, T1F0, T1F1, T0F2, T0F3, T1F2, T1,F3)). However, we should use that implementation if the use case is doing a pack_untilize for a block of tiles (i.e not fused).

One major optimization, is to enable the use of contexts.

THCON_SEC0_REG1_L1_Dest_addr_ADDR32 = CNTX0 = tile_offset (Configured for top faces)
THCON_SEC0_REG8_L1_Dest_addr_ADDR32 = CNTX2 = tile_offset + block_ct_dim * TILE_C_DIM *datum_size ((Configured for bottom faces)

Then algorithm can do 4x16 rows, will change to something like this:

CH0 x_stride = number of bytes per datum (2 bytes for Float16b default)
CH0 y_stride = FACE_C_DIM*x_stride
CH0 z_stride = FACE_R_DIM*y_stride (1x16x16 datums default, already done in pack_hw_config)
CH0 w_stride = 4*z_stride (32x32 by default)
PACK_INTF_SELECT_0 = 0b0011 (read 2 rows from dest)
PACK_INTF_SELECT_1 = 0b1100 (read 2 rows from dest)
Dest Mode = DST_STRIDED_MODE (Each row read from dest is 16 apart)

    for face_r_dim (16 by default):
        for block_ct_dim (2 in this example) {
            PACK(CNTX0, PACK_INTF_SELECT_0); (Reads 2 rows, writes them contigous in L1, dest rows = 0,16, 1, 17, 2, 18 .....)
            PACK(CNTX1, PACK_INTF_SELECT_1); (Reads 2 rows, writes them contigous in L1, dest rows = 32,48, 33, 49,  .....)
            W_CNT += 1 (Jumps to next tile)
        }
        Y_CNT+=1
    }

The problem here is if contexts are used, initial hardware configuration for data formats, tile sizes, etc, all need to also be programmed.

@ttmtrajkovic @rdjogoTT fyi

@rtawfik01 rtawfik01 added the Performance Feature that helps with performance, not a blocker for functionality label Jul 19, 2024
@rtawfik01 rtawfik01 changed the title Pack Untilize Kernel Pack Untilize Kernel Perf Optimization Jul 19, 2024
@rtawfik01
Copy link
Collaborator Author

rtawfik01 commented Aug 14, 2024

Measured perf for pack untilize block test with following args:
block C dim: 4 tiles
block R dim: 1 tile
Num cores: 1

To reproduce:

git checkout rtawfik/pack_untilize_perf
ENABLE_TRACY=1 scripts/build_scripts/build_with_profiler_opt.sh
ninja tests -C build
ENABLE_TRACY=1 TT_METAL_DEVICE_PROFILER=1 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/unit_tests --gtest_filter=“*ComputePackUntilize*”

And the Tracy GUI results can be seen here:
image
Cycles can be calculated using the machines AICLK and The GPU execution time

The branch has the blocking calls for circular buffers and math-pack semaphores commented out. The results for the number of cycles the PACK takes is:

WHB0: ~145 cycles (average 10 runs)
BH: ~155 cycles (average 10 runs)

The results consistently show BH and WHB0 have around the same cycles count difference for block of tiles less than 6. For c_dim > 6, unfortunately Wormhole B0 has a smaller depth instruction buffer (16 insns), in comparison to Blackhole (32 insns), and that affects the results since at around 6 tiles is when math issues 16 insns and needs to stall for wormhole b0. Better results can be obtained from waveforms.

@ttmtrajkovic fyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Feature that helps with performance, not a blocker for functionality
Projects
None yet
Development

No branches or pull requests

2 participants