-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s) #14305
Comments
In the mop config
Changing the condition to
Fixes the issue. The comment "dont close the tile, point to the next face" makes me think it is for a partial_face but that is wide, that is 16 X 32 one. We should just use the default code for our 16X16 tile. Need to test more and I need to learn more about the packer to be sure, but it fixes the issue in the test case I get pcc of 0.998 which is failure but that is due to bfp8 |
Yu is making the PR with the above suggested change, as this change does not affect any current test case and it makes sense to be in the same PR where new tests are added. |
@amahmudTT is this a P0? A company level emergency? |
@jvasilje So tiny tiles related task had a deadline in November and it needed to be worked on and this bug needed to be sorted out asap before end of October so I had put P0. But we kind of figured the issue out and it is being integrated with Yu's PR so I think P0 would not be relevant anymore. Thanks for changing it. |
@yugaoTT did you merge your PR ? I will close it if you did. |
I merged a PR to enable the testing of this issue. |
I tried changing the line, seems 16x16, 16x32 are failing with bad pcc. please try out this test to repro (on main)
|
it might be related to partial_face, as the failing ones all have 16 as tile height dim |
btw, don't think its P0 as well, as most our models use bfloat16 as output |
after chat with @tt-aho, we found the problem is dispatcher assume 32B alignment currently. with 16x16 tiny tile in bfloat8_b, each tile is 272B. The dispatcher will pad to 288B per tile. When reading multiple tiles from a core, it reads N*288B, put first 272B into user buffer, skip 16B, then put another 272B. Once we have @abhullar-tt change (16B align for L1) into main, this can be solved for 16x16 tiny tiles. |
That branch ( |
@amahmudTT @yugaoTT is this still an open issue ? If not, can you please close this ? Thanks. |
@yugaoTT this is resolved right ? |
I think it still needs #13762 |
Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s)
The dest register has the correct value. Implying there is a problem with the packing.
To Reproduce
Branch : yugao/tile
Test : pytest tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_reuse_config_sharded_tiny_tile
It will call the underlying compute kernel:
ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp
Changing the output dtype to bfloat16 will make the test pass
The text was updated successfully, but these errors were encountered: