-
I am not sure how to prepare the input to say Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
Mapping matrix entries to registers for the MFMA instructions can be highly confusing. Fortunately, Joe Greathouse made a tool that helps tremendously: https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator. For the instruction you are looking at, for instance (showing partial output) :
The mapping for matrices B, C and D can be queried similarly. |
Beta Was this translation helpful? Give feedback.
-
@rwvo Thanks so much for reply. I already was aware of the amd matrix calculator. To me the above instruction is similar to Thanks, |
Beta Was this translation helpful? Give feedback.
-
In the line of code that you quoted, the left-shift over 8 * (3-i) bits seems suspicious to me. Wouldn't that put A[0][0] in the higher-order (left-most) byte of a for thread (0,0), while it's supposed to go into the lower-order (right-most) byte? In any case: one of my colleagues wrote a working example for __builtin_amdgcn_mfma_i32_16x16x16i8. It should appear on the blog soon, but I attach it here for your reference. It uses arrays of int8_t values, and then casts to int32_t in the intrinsics call. Attaching as *.txt because github complains about not supporting *.cpp. |
Beta Was this translation helpful? Give feedback.
-
Many many thanks @rwvo. I really appreciated your quick support. |
Beta Was this translation helpful? Give feedback.
-
@mmoadeli, FYI, this example is now part of the repo. |
Beta Was this translation helpful? Give feedback.
-
thanks @gsitaram |
Beta Was this translation helpful? Give feedback.
In the line of code that you quoted, the left-shift over 8 * (3-i) bits seems suspicious to me. Wouldn't that put A[0][0] in the higher-order (left-most) byte of a for thread (0,0), while it's supposed to go into the lower-order (right-most) byte?
In any case: one of my colleagues wrote a working example for __builtin_amdgcn_mfma_i32_16x16x16i8. It should appear on the blog soon, but I attach it here for your reference. It uses arrays of int8_t values, and then casts to int32_t in the intrinsics call. Attaching as *.txt because github complains about not supporting *.cpp.
mfma_i32_16x16x16i8.txt