Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s) #14305

Open
amahmudTT opened this issue Oct 25, 2024 · 14 comments
Assignees
Labels
bug Something isn't working LLK P1 WH

Comments

@amahmudTT
Copy link
Contributor

Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s)
The dest register has the correct value. Implying there is a problem with the packing.

To Reproduce

Branch : yugao/tile

Test : pytest tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_reuse_config_sharded_tiny_tile

It will call the underlying compute kernel:
ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp

Changing the output dtype to bfloat16 will make the test pass

@amahmudTT amahmudTT added bug Something isn't working LLK labels Oct 25, 2024
@amahmudTT amahmudTT self-assigned this Oct 25, 2024
@amahmudTT
Copy link
Contributor Author

In the mop config

        ckernel::ckernel_template tmp(MOP_OUTER_LOOP, MOP_INNER_LOOP, TT_OP_PACR(ADDR_MOD_1, ZERO_OUTPUT_FLAG, PACK_SEL(PACKCNT), 0, MEGAROW, 0, 1));

        if (partial_face && IS_BFP_FORMAT(pack_dst_format)) {
            tmp.set_start_op(TT_OP_PACR(ADDR_MOD_0, ZERO_OUTPUT_FLAG, PACK_SEL(PACKCNT), 0, MEGAROW, 0, 0)); // Don't close the tile, point to the next face
            tmp.set_loop_op0(TT_OP_INCADCXY(p_setadc::PAC, 0, 0, 1, 0)); // Inc ch0_y+=1 (addr_mod_0 will increment by 15)
            tmp.set_loop_op1(TT_OP_PACR(ADDR_MOD_1, ZERO_OUTPUT_FLAG, PACK_SEL(PACKCNT), 0, MEGAROW, 0, 1)); // Close the tile
        }

Changing the condition to

if (partial_face && !narrow_tile && IS_BFP_FORMAT(pack_dst_format)) 

Fixes the issue. The comment "dont close the tile, point to the next face" makes me think it is for a partial_face but that is wide, that is 16 X 32 one. We should just use the default code for our 16X16 tile. Need to test more and I need to learn more about the packer to be sure, but it fixes the issue in the test case I get pcc of 0.998 which is failure but that is due to bfp8

@amahmudTT
Copy link
Contributor Author

Yu is making the PR with the above suggested change, as this change does not affect any current test case and it makes sense to be in the same PR where new tests are added.

@jvasilje
Copy link
Collaborator

jvasilje commented Nov 6, 2024

@amahmudTT is this a P0? A company level emergency?

@jvasilje jvasilje added P1 and removed P0 labels Nov 6, 2024
@amahmudTT
Copy link
Contributor Author

amahmudTT commented Nov 6, 2024

@jvasilje So tiny tiles related task had a deadline in November and it needed to be worked on and this bug needed to be sorted out asap before end of October so I had put P0. But we kind of figured the issue out and it is being integrated with Yu's PR so I think P0 would not be relevant anymore. Thanks for changing it.

@amahmudTT
Copy link
Contributor Author

@yugaoTT did you merge your PR ? I will close it if you did.

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 6, 2024

I merged a PR to enable the testing of this issue.
Will test with if (partial_face && !narrow_tile && IS_BFP_FORMAT(pack_dst_format)) changes

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 6, 2024

I tried changing the line, seems 16x16, 16x32 are failing with bad pcc.
the passing ones are 32x16 and 32x32.

please try out this test to repro (on main)


@run_for_wormhole_b0()
@pytest.mark.parametrize("m", [768])
@pytest.mark.parametrize("k", [1024])
@pytest.mark.parametrize("n", [768])
@pytest.mark.parametrize("has_bias", [False, True])
@pytest.mark.parametrize("grid_size", [(8, 4)])
@pytest.mark.parametrize("tile_h", [16, 32])
@pytest.mark.parametrize("tile_w", [16, 32])
@pytest.mark.parametrize("in0_sharded", [True])
@pytest.mark.parametrize("out_sharded", [True])
@pytest.mark.parametrize("in1_dtype", [ttnn.bfloat16, ttnn.bfloat8_b])
@pytest.mark.parametrize("transpose_tile", [True, False])
def test_matmul_2d_tiny_tile(
    device, m, k, n, has_bias, grid_size, tile_h, tile_w, in0_sharded, out_sharded, in1_dtype, transpose_tile
):
    in0_shape = [1, 1, m, k]
    in1_shape = [1, 1, k, n]
    bias_shape = [1, 1, n]

    in0_block_w = k // grid_size[0] // 32
    out_block_h = m // grid_size[1] // tile_h
    out_block_w = n // grid_size[0] // tile_w
    out_subblock_h, out_subblock_w, _ = find_max_subblock(out_block_h, out_block_w)

    in0 = torch.ones(in0_shape).bfloat16()
    in1 = torch.randn(in1_shape).bfloat16()

    if in0_sharded:
        in0_memory_config = ttnn.create_sharded_memory_config(
            (1, 1, m, k),
            core_grid=ttnn.CoreGrid(y=grid_size[1], x=grid_size[0]),
            strategy=ttnn.ShardStrategy.BLOCK,
            orientation=ttnn.ShardOrientation.ROW_MAJOR,
        )
    else:
        in0_memory_config = ttnn.L1_MEMORY_CONFIG
    in0_t = ttnn.from_torch(
        in0,
        tile=ttnn.Tile((tile_h, 32)),
        dtype=ttnn.bfloat16,
        layout=ttnn.TILE_LAYOUT,
        device=device,
        memory_config=in0_memory_config,
    )
    in1_t = ttnn.from_torch(
        in1,
        tile=ttnn.Tile((32, tile_w), transpose_tile=transpose_tile),
        dtype=in1_dtype,
        layout=ttnn.TILE_LAYOUT,
        device=device,
        memory_config=ttnn.DRAM_MEMORY_CONFIG,
    )

    if has_bias:
        bias = torch.randn(bias_shape).bfloat16().float()
        bias_padded = bias.unsqueeze(2)
        bias_padded = torch.nn.functional.pad(bias_padded, (0, 0, 0, tile_h - bias_padded.size(2)), "constant", 0)
        bias_t = ttnn.from_torch(
            bias_padded,
            tile=ttnn.Tile((tile_h, tile_w)),
            dtype=ttnn.bfloat16,
            layout=ttnn.TILE_LAYOUT,
            device=device,
            memory_config=ttnn.DRAM_MEMORY_CONFIG,
        )

    program_config = ttnn.MatmulMultiCoreReuseMultiCastProgramConfig(
        compute_with_storage_grid_size=grid_size,
        in0_block_w=in0_block_w,
        out_subblock_h=out_subblock_h,
        out_subblock_w=out_subblock_w,
        per_core_M=out_block_h,
        per_core_N=out_block_w,
        transpose_mcast=False,
        fused_activation=None,
    )

    if is_grayskull():
        compute_kernel_config = ttnn.GrayskullComputeKernelConfig(
            math_fidelity=ttnn.MathFidelity.LoFi,
            math_approx_mode=True,
        )
    else:
        compute_kernel_config = ttnn.WormholeComputeKernelConfig(
            math_fidelity=ttnn.MathFidelity.LoFi,
            math_approx_mode=True,
            fp32_dest_acc_en=False,
            packer_l1_acc=True,
        )
    if out_sharded:
        out_mem_config = ttnn.MemoryConfig(
            memory_layout=ttnn.TensorMemoryLayout.BLOCK_SHARDED,
            buffer_type=ttnn.BufferType.L1,
        )
    else:
        out_mem_config = ttnn.L1_MEMORY_CONFIG
    if out_sharded:
        output_tile = ttnn.Tile([tile_h, 32]) if tile_h <= 16 else ttnn.Tile([tile_h, tile_w])
    else:
        output_tile = ttnn.Tile([tile_h, tile_w])
    if has_bias:
        output_t = ttnn.linear(
            in0_t,
            in1_t,
            bias=bias_t,
            program_config=program_config,
            memory_config=out_mem_config,
            dtype=ttnn.bfloat8_b,
            compute_kernel_config=compute_kernel_config,
            output_tile=output_tile,
        )
    else:
        output_t = ttnn.matmul(
            in0_t,
            in1_t,
            program_config=program_config,
            memory_config=out_mem_config,
            dtype=ttnn.bfloat8_b,
            compute_kernel_config=compute_kernel_config,
            output_tile=output_tile,
        )
    output_tensor = ttnn.to_torch(output_t)
    pt_out = in0 @ in1
    if has_bias:
        pt_out += bias

    assert_with_pcc(pt_out, output_tensor, 0.999)

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 6, 2024

it might be related to partial_face, as the failing ones all have 16 as tile height dim

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 6, 2024

btw, don't think its P0 as well, as most our models use bfloat16 as output

@yugaoTT
Copy link
Contributor

yugaoTT commented Nov 7, 2024

after chat with @tt-aho, we found the problem is dispatcher assume 32B alignment currently.

with 16x16 tiny tile in bfloat8_b, each tile is 272B. The dispatcher will pad to 288B per tile. When reading multiple tiles from a core, it reads N*288B, put first 272B into user buffer, skip 16B, then put another 272B.
That's why we miss 16B data.

Once we have @abhullar-tt change (16B align for L1) into main, this can be solved for 16x16 tiny tiles.

@abhullar-tt
Copy link
Contributor

That branch (abhullar/diff-aligns) won't be merged to main until we get BH op parity (#12349). We are focusing on getting BH ops functional without relying on those changes first because it introduces some other op issues.

@prajaramanTT
Copy link

@amahmudTT @yugaoTT is this still an open issue ? If not, can you please close this ? Thanks.

@amahmudTT
Copy link
Contributor Author

@yugaoTT this is resolved right ?

@yugaoTT
Copy link
Contributor

yugaoTT commented Jan 20, 2025

I think it still needs #13762
in to test out the different alignment impact on the tiny tile output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working LLK P1 WH
Projects
None yet
Development

No branches or pull requests

5 participants