Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s) #14305

amahmudTT · 2024-10-25T19:37:41Z

Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s)
The dest register has the correct value. Implying there is a problem with the packing.

To Reproduce

Branch : yugao/tile

Test : pytest tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_reuse_config_sharded_tiny_tile

It will call the underlying compute kernel:
ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp

Changing the output dtype to bfloat16 will make the test pass

amahmudTT · 2024-10-29T12:43:42Z

In the mop config

        ckernel::ckernel_template tmp(MOP_OUTER_LOOP, MOP_INNER_LOOP, TT_OP_PACR(ADDR_MOD_1, ZERO_OUTPUT_FLAG, PACK_SEL(PACKCNT), 0, MEGAROW, 0, 1));

        if (partial_face && IS_BFP_FORMAT(pack_dst_format)) {
            tmp.set_start_op(TT_OP_PACR(ADDR_MOD_0, ZERO_OUTPUT_FLAG, PACK_SEL(PACKCNT), 0, MEGAROW, 0, 0)); // Don't close the tile, point to the next face
            tmp.set_loop_op0(TT_OP_INCADCXY(p_setadc::PAC, 0, 0, 1, 0)); // Inc ch0_y+=1 (addr_mod_0 will increment by 15)
            tmp.set_loop_op1(TT_OP_PACR(ADDR_MOD_1, ZERO_OUTPUT_FLAG, PACK_SEL(PACKCNT), 0, MEGAROW, 0, 1)); // Close the tile
        }

Changing the condition to

if (partial_face && !narrow_tile && IS_BFP_FORMAT(pack_dst_format))

Fixes the issue. The comment "dont close the tile, point to the next face" makes me think it is for a partial_face but that is wide, that is 16 X 32 one. We should just use the default code for our 16X16 tile. Need to test more and I need to learn more about the packer to be sure, but it fixes the issue in the test case I get pcc of 0.998 which is failure but that is due to bfp8

amahmudTT · 2024-10-30T05:34:03Z

Yu is making the PR with the above suggested change, as this change does not affect any current test case and it makes sense to be in the same PR where new tests are added.

jvasilje · 2024-11-06T02:21:51Z

@amahmudTT is this a P0? A company level emergency?

amahmudTT · 2024-11-06T02:40:52Z

@jvasilje So tiny tiles related task had a deadline in November and it needed to be worked on and this bug needed to be sorted out asap before end of October so I had put P0. But we kind of figured the issue out and it is being integrated with Yu's PR so I think P0 would not be relevant anymore. Thanks for changing it.

amahmudTT · 2024-11-06T02:41:35Z

@yugaoTT did you merge your PR ? I will close it if you did.

yugaoTT · 2024-11-06T14:24:52Z

I merged a PR to enable the testing of this issue.
Will test with if (partial_face && !narrow_tile && IS_BFP_FORMAT(pack_dst_format)) changes

yugaoTT · 2024-11-06T14:56:02Z

I tried changing the line, seems 16x16, 16x32 are failing with bad pcc.
the passing ones are 32x16 and 32x32.

please try out this test to repro (on main)


@run_for_wormhole_b0()
@pytest.mark.parametrize("m", [768])
@pytest.mark.parametrize("k", [1024])
@pytest.mark.parametrize("n", [768])
@pytest.mark.parametrize("has_bias", [False, True])
@pytest.mark.parametrize("grid_size", [(8, 4)])
@pytest.mark.parametrize("tile_h", [16, 32])
@pytest.mark.parametrize("tile_w", [16, 32])
@pytest.mark.parametrize("in0_sharded", [True])
@pytest.mark.parametrize("out_sharded", [True])
@pytest.mark.parametrize("in1_dtype", [ttnn.bfloat16, ttnn.bfloat8_b])
@pytest.mark.parametrize("transpose_tile", [True, False])
def test_matmul_2d_tiny_tile(
    device, m, k, n, has_bias, grid_size, tile_h, tile_w, in0_sharded, out_sharded, in1_dtype, transpose_tile
):
    in0_shape = [1, 1, m, k]
    in1_shape = [1, 1, k, n]
    bias_shape = [1, 1, n]

    in0_block_w = k // grid_size[0] // 32
    out_block_h = m // grid_size[1] // tile_h
    out_block_w = n // grid_size[0] // tile_w
    out_subblock_h, out_subblock_w, _ = find_max_subblock(out_block_h, out_block_w)

    in0 = torch.ones(in0_shape).bfloat16()
    in1 = torch.randn(in1_shape).bfloat16()

    if in0_sharded:
        in0_memory_config = ttnn.create_sharded_memory_config(
            (1, 1, m, k),
            core_grid=ttnn.CoreGrid(y=grid_size[1], x=grid_size[0]),
            strategy=ttnn.ShardStrategy.BLOCK,
            orientation=ttnn.ShardOrientation.ROW_MAJOR,
        )
    else:
        in0_memory_config = ttnn.L1_MEMORY_CONFIG
    in0_t = ttnn.from_torch(
        in0,
        tile=ttnn.Tile((tile_h, 32)),
        dtype=ttnn.bfloat16,
        layout=ttnn.TILE_LAYOUT,
        device=device,
        memory_config=in0_memory_config,
    )
    in1_t = ttnn.from_torch(
        in1,
        tile=ttnn.Tile((32, tile_w), transpose_tile=transpose_tile),
        dtype=in1_dtype,
        layout=ttnn.TILE_LAYOUT,
        device=device,
        memory_config=ttnn.DRAM_MEMORY_CONFIG,
    )

    if has_bias:
        bias = torch.randn(bias_shape).bfloat16().float()
        bias_padded = bias.unsqueeze(2)
        bias_padded = torch.nn.functional.pad(bias_padded, (0, 0, 0, tile_h - bias_padded.size(2)), "constant", 0)
        bias_t = ttnn.from_torch(
            bias_padded,
            tile=ttnn.Tile((tile_h, tile_w)),
            dtype=ttnn.bfloat16,
            layout=ttnn.TILE_LAYOUT,
            device=device,
            memory_config=ttnn.DRAM_MEMORY_CONFIG,
        )

    program_config = ttnn.MatmulMultiCoreReuseMultiCastProgramConfig(
        compute_with_storage_grid_size=grid_size,
        in0_block_w=in0_block_w,
        out_subblock_h=out_subblock_h,
        out_subblock_w=out_subblock_w,
        per_core_M=out_block_h,
        per_core_N=out_block_w,
        transpose_mcast=False,
        fused_activation=None,
    )

    if is_grayskull():
        compute_kernel_config = ttnn.GrayskullComputeKernelConfig(
            math_fidelity=ttnn.MathFidelity.LoFi,
            math_approx_mode=True,
        )
    else:
        compute_kernel_config = ttnn.WormholeComputeKernelConfig(
            math_fidelity=ttnn.MathFidelity.LoFi,
            math_approx_mode=True,
            fp32_dest_acc_en=False,
            packer_l1_acc=True,
        )
    if out_sharded:
        out_mem_config = ttnn.MemoryConfig(
            memory_layout=ttnn.TensorMemoryLayout.BLOCK_SHARDED,
            buffer_type=ttnn.BufferType.L1,
        )
    else:
        out_mem_config = ttnn.L1_MEMORY_CONFIG
    if out_sharded:
        output_tile = ttnn.Tile([tile_h, 32]) if tile_h <= 16 else ttnn.Tile([tile_h, tile_w])
    else:
        output_tile = ttnn.Tile([tile_h, tile_w])
    if has_bias:
        output_t = ttnn.linear(
            in0_t,
            in1_t,
            bias=bias_t,
            program_config=program_config,
            memory_config=out_mem_config,
            dtype=ttnn.bfloat8_b,
            compute_kernel_config=compute_kernel_config,
            output_tile=output_tile,
        )
    else:
        output_t = ttnn.matmul(
            in0_t,
            in1_t,
            program_config=program_config,
            memory_config=out_mem_config,
            dtype=ttnn.bfloat8_b,
            compute_kernel_config=compute_kernel_config,
            output_tile=output_tile,
        )
    output_tensor = ttnn.to_torch(output_t)
    pt_out = in0 @ in1
    if has_bias:
        pt_out += bias

    assert_with_pcc(pt_out, output_tensor, 0.999)

yugaoTT · 2024-11-06T14:57:29Z

it might be related to partial_face, as the failing ones all have 16 as tile height dim

yugaoTT · 2024-11-06T14:58:35Z

btw, don't think its P0 as well, as most our models use bfloat16 as output

yugaoTT · 2024-11-07T16:47:27Z

after chat with @tt-aho, we found the problem is dispatcher assume 32B alignment currently.

with 16x16 tiny tile in bfloat8_b, each tile is 272B. The dispatcher will pad to 288B per tile. When reading multiple tiles from a core, it reads N*288B, put first 272B into user buffer, skip 16B, then put another 272B.
That's why we miss 16B data.

Once we have @abhullar-tt change (16B align for L1) into main, this can be solved for 16x16 tiny tiles.

abhullar-tt · 2024-11-07T16:50:28Z

That branch (abhullar/diff-aligns) won't be merged to main until we get BH op parity (#12349). We are focusing on getting BH ops functional without relying on those changes first because it introduces some other op issues.

prajaramanTT · 2025-01-09T22:54:12Z

@amahmudTT @yugaoTT is this still an open issue ? If not, can you please close this ? Thanks.

amahmudTT · 2025-01-20T01:23:43Z

@yugaoTT this is resolved right ?

yugaoTT · 2025-01-20T14:08:43Z

I think it still needs #13762
in to test out the different alignment impact on the tiny tile output

amahmudTT added bug Something isn't working LLK labels Oct 25, 2024

amahmudTT self-assigned this Oct 25, 2024

amahmudTT added WH P0 labels Oct 25, 2024

jvasilje added P1 and removed P0 labels Nov 6, 2024

amahmudTT assigned yugaoTT Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s) #14305

Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s) #14305

amahmudTT commented Oct 25, 2024

amahmudTT commented Oct 29, 2024

amahmudTT commented Oct 30, 2024

jvasilje commented Nov 6, 2024

amahmudTT commented Nov 6, 2024 •

edited

Loading

amahmudTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 7, 2024

abhullar-tt commented Nov 7, 2024

prajaramanTT commented Jan 9, 2025

amahmudTT commented Jan 20, 2025

yugaoTT commented Jan 20, 2025

Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s) #14305

Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s) #14305

Comments

amahmudTT commented Oct 25, 2024

amahmudTT commented Oct 29, 2024

amahmudTT commented Oct 30, 2024

jvasilje commented Nov 6, 2024

amahmudTT commented Nov 6, 2024 • edited Loading

amahmudTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 6, 2024

yugaoTT commented Nov 7, 2024

abhullar-tt commented Nov 7, 2024

prajaramanTT commented Jan 9, 2025

amahmudTT commented Jan 20, 2025

yugaoTT commented Jan 20, 2025

amahmudTT commented Nov 6, 2024 •

edited

Loading