No test is available for in/int8 and out/int32 #8

mmoadeli · 2023-09-27T09:55:12Z

mmoadeli
Sep 27, 2023

I am not sure how to prepare the input to say __builtin_amdgcn_mfma_i32_16x16x16i8.
The a and b expected to be int32. I packed the four int8 to an int32 as below
for (int i = 0; i < 4; ++i) { const int r_idx = thread_x * K + i + thread_y * 4; a |= (int32_t(src[r_idx]) << 8 * (3 - i)); }
In above, src is an array of int8's. And a is an int32.
calling the instruction, does not seem to be producing the expected results.
would any one please advise how to prepare the input data a and b in above instruction.
Or ideally add a test with in/int8 and out/int32 please.
Apologies if this is not the right place to ask for this. I was not allowed to add this in Discussion part.

Thanks,

Answered by rwvo

Sep 27, 2023

In the line of code that you quoted, the left-shift over 8 * (3-i) bits seems suspicious to me. Wouldn't that put A[0][0] in the higher-order (left-most) byte of a for thread (0,0), while it's supposed to go into the lower-order (right-most) byte?

In any case: one of my colleagues wrote a working example for __builtin_amdgcn_mfma_i32_16x16x16i8. It should appear on the blog soon, but I attach it here for your reference. It uses arrays of int8_t values, and then casts to int32_t in the intrinsics call. Attaching as *.txt because github complains about not supporting *.cpp.

mfma_i32_16x16x16i8.txt

View full answer

rwvo · 2023-09-27T16:04:04Z

rwvo
Sep 27, 2023

Mapping matrix entries to registers for the MFMA instructions can be highly confusing. Fortunately, Joe Greathouse made a tool that helps tremendously: https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator.

For the instruction you are looking at, for instance (showing partial output) :

$ ./matrix_calculator.py --architecture CDNA2 -i v_mfma_i32_16x16x16i8 --matrix-layout -A
Architecture: CDNA2
Instruction: V_MFMA_I32_16X16X16I8
+--------+------------+-------------+--------------+--------------+
|   lane | v0.[7:0]   | v0.[15:8]   | v0.[23:16]   | v0.[31:24]   |
+========+============+=============+==============+==============+
|      0 | A[0][0]    | A[0][1]     | A[0][2]      | A[0][3]      |
+--------+------------+-------------+--------------+--------------+
|      1 | A[1][0]    | A[1][1]     | A[1][2]      | A[1][3]      |
+--------+------------+-------------+--------------+--------------+
|      2 | A[2][0]    | A[2][1]     | A[2][2]      | A[2][3]      |
+--------+------------+-------------+--------------+--------------+
|      3 | A[3][0]    | A[3][1]     | A[3][2]      | A[3][3]      |
+--------+------------+-------------+--------------+--------------+
|      4 | A[4][0]    | A[4][1]     | A[4][2]      | A[4][3]      |
+--------+------------+-------------+--------------+--------------+
|      5 | A[5][0]    | A[5][1]     | A[5][2]      | A[5][3]      |
+--------+------------+-------------+--------------+--------------+
|      6 | A[6][0]    | A[6][1]     | A[6][2]      | A[6][3]      |
+--------+------------+-------------+--------------+--------------+
|      7 | A[7][0]    | A[7][1]     | A[7][2]      | A[7][3]      |
+--------+------------+-------------+--------------+--------------+
|      8 | A[8][0]    | A[8][1]     | A[8][2]      | A[8][3]      |
+--------+------------+-------------+--------------+--------------+

The mapping for matrices B, C and D can be queried similarly.

0 replies

mmoadeli · 2023-09-27T17:05:48Z

mmoadeli
Sep 27, 2023
Author

@rwvo Thanks so much for reply.

I already was aware of the amd matrix calculator. To me the above instruction is similar to __builtin_amdgcn_mfma_f32_16x16x16f16, used in sample.
A difference to me seems to be a and b types.
a and b in this sample are of below type
using float16x4 = __attribute__((__vector_size__(4 * sizeof(float16_t)))) float16_t;
while they are int32_t for __builtin_amdgcn_mfma_i32_16x16x16i8 case. That's why I have populated an int32_t type with 4 int8_t values.
Would you be able to advise how should I prepare a and b for above instruction please?
Ideally, I'd be grateful to have revised sample for __builtin_amdgcn_mfma_i32_16x16x16i8.

Thanks,

0 replies

rwvo · 2023-09-27T18:38:11Z

rwvo
Sep 27, 2023

In the line of code that you quoted, the left-shift over 8 * (3-i) bits seems suspicious to me. Wouldn't that put A[0][0] in the higher-order (left-most) byte of a for thread (0,0), while it's supposed to go into the lower-order (right-most) byte?

In any case: one of my colleagues wrote a working example for __builtin_amdgcn_mfma_i32_16x16x16i8. It should appear on the blog soon, but I attach it here for your reference. It uses arrays of int8_t values, and then casts to int32_t in the intrinsics call. Attaching as *.txt because github complains about not supporting *.cpp.

mfma_i32_16x16x16i8.txt

0 replies

mmoadeli · 2023-09-27T18:45:11Z

mmoadeli
Sep 27, 2023
Author

Many many thanks @rwvo. I really appreciated your quick support.

0 replies

gsitaram · 2023-10-11T15:54:28Z

gsitaram
Oct 11, 2023
Maintainer

@mmoadeli, FYI, this example is now part of the repo.

0 replies

mmoadeli · 2023-10-11T17:12:23Z

mmoadeli
Oct 11, 2023
Author

thanks @gsitaram

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No test is available for in/int8 and out/int32 #8

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

No test is available for in/int8 and out/int32 #8

mmoadeli Sep 27, 2023

Replies: 6 comments

rwvo Sep 27, 2023

mmoadeli Sep 27, 2023 Author

rwvo Sep 27, 2023

mmoadeli Sep 27, 2023 Author

gsitaram Oct 11, 2023 Maintainer

mmoadeli Oct 11, 2023 Author

mmoadeli
Sep 27, 2023

rwvo
Sep 27, 2023

mmoadeli
Sep 27, 2023
Author

rwvo
Sep 27, 2023

mmoadeli
Sep 27, 2023
Author

gsitaram
Oct 11, 2023
Maintainer

mmoadeli
Oct 11, 2023
Author