You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For wormhole b0, 8x16 rows can be output per tile (before incrementing counters), while for blackhole with the above implementation, only 2x16 can. Since pack_untilize feature needs to be fused with other operations, we cannot change the unpacker to use a different unpacking scheme (Fastest untilize scheme would be T0F0, T0F1, T1F0, T1F1, T0F2, T0F3, T1F2, T1,F3)). However, we should use that implementation if the use case is doing a pack_untilize for a block of tiles (i.e not fused).
One major optimization, is to enable the use of contexts.
THCON_SEC0_REG1_L1_Dest_addr_ADDR32 = CNTX0 = tile_offset (Configured for top faces)
THCON_SEC0_REG8_L1_Dest_addr_ADDR32 = CNTX2 = tile_offset + block_ct_dim * TILE_C_DIM *datum_size ((Configured for bottom faces)
Then algorithm can do 4x16 rows, will change to something like this:
CH0 x_stride = number of bytes per datum (2 bytes for Float16b default)
CH0 y_stride = FACE_C_DIM*x_stride
CH0 z_stride = FACE_R_DIM*y_stride (1x16x16 datums default, already done in pack_hw_config)
CH0 w_stride = 4*z_stride (32x32 by default)
PACK_INTF_SELECT_0 = 0b0011 (read 2 rows from dest)
PACK_INTF_SELECT_1 = 0b1100 (read 2 rows from dest)
Dest Mode = DST_STRIDED_MODE (Each row read from dest is 16 apart)
for face_r_dim (16 by default):
for block_ct_dim (2 in this example) {
PACK(CNTX0, PACK_INTF_SELECT_0); (Reads 2 rows, writes them contigous in L1, dest rows = 0,16, 1, 17, 2, 18 .....)
PACK(CNTX1, PACK_INTF_SELECT_1); (Reads 2 rows, writes them contigous in L1, dest rows = 32,48, 33, 49, .....)
W_CNT += 1 (Jumps to next tile)
}
Y_CNT+=1
}
The problem here is if contexts are used, initial hardware configuration for data formats, tile sizes, etc, all need to also be programmed.
The results consistently show BH and WHB0 have around the same cycles count difference for block of tiles less than 6. For c_dim > 6, unfortunately Wormhole B0 has a smaller depth instruction buffer (16 insns), in comparison to Blackhole (32 insns), and that affects the results since at around 6 tiles is when math issues 16 insns and needs to stall for wormhole b0. Better results can be obtained from waveforms.
Pack untilize kernel needed to be re-written for Blackhole, due to Blackhole packer limitations:
Here is the current algorithm:
For wormhole b0, 8x16 rows can be output per tile (before incrementing counters), while for blackhole with the above implementation, only 2x16 can. Since pack_untilize feature needs to be fused with other operations, we cannot change the unpacker to use a different unpacking scheme (Fastest untilize scheme would be T0F0, T0F1, T1F0, T1F1, T0F2, T0F3, T1F2, T1,F3)). However, we should use that implementation if the use case is doing a pack_untilize for a block of tiles (i.e not fused).
One major optimization, is to enable the use of contexts.
THCON_SEC0_REG1_L1_Dest_addr_ADDR32 = CNTX0 = tile_offset (Configured for top faces)
THCON_SEC0_REG8_L1_Dest_addr_ADDR32 = CNTX2 = tile_offset + block_ct_dim * TILE_C_DIM *datum_size ((Configured for bottom faces)
Then algorithm can do 4x16 rows, will change to something like this:
The problem here is if contexts are used, initial hardware configuration for data formats, tile sizes, etc, all need to also be programmed.
@ttmtrajkovic @rdjogoTT fyi
The text was updated successfully, but these errors were encountered: