Optimize non-transposed int8 GEMV kernel #559

robertknight · 2025-01-29T08:37:45Z

Previously the non-transposed int8 GEMV kernel iterated over the B matrix in blocks of I32_VEC_LEN columns, 4 rows. At each step it loaded I32_VEC_LEN int8 values and sign-extended to i32. These were interleaved to give a 4x4 transposed tile of B and two dot product instructions were used to update I32_VEC_LEN dot products and column sums. This version reduces the number of loads from B by loading 4 rows of I8_VEC_LEN columns at a time, interleaving to give I32_VEC_LEN x 4x4 transposed tiles and using I32_VEC_LEN * 2 dot product instructions to update I8_VEC_LEN dot products and column sums.

On M3 Pro (5 perf cores) this improves GEMV performance for the non-transposed case from 45 GF -> 61 GF. This compares to 90 GF for the case where B is transposed. For wasmtime (single core) this goes from 6.8 -> 16.6 GF compared to 28 when transposed.

A downside of this change is that the number of tail columns is now much larger (up to 15, 31, or 63 on AVX512). This could possibly be handled by using masked loads for the tail.

Add SimdInt::{zip_lo_i8, zip_hi_i8, zip_lo_i16, zip_hi_i16} methods for interleaving packed 8 and 16-bit ints
Use these methods to implement a more efficient GEMV as described above
Fix x86 build for macOS that was broken in an earlier commit (and is not tested in CI :( )

TODO:

Fix issue with zip on AVX2 x64 due to it operating on 128-bit blocks instead of the whole 256-bit vec
Test performance impact on non-Arm platforms
Fix x64 issue with zip methods, but for AVX-512

Instead of loading 4 rows of 4 elements from B at a time, load 4 rows of 16 elements and interleave to give 4 x 4x4 transposed tiles. The inner loop over K was also changed to manually increment `k` instead of using `range_chunks_exact` as this generated slightly better code. On an M3 Pro this improved the `bench_gemm_mix` gemv benchmark for int8 from ~45 GFLOPS to ~60 GFLOPS for the non-transposed case.

robertknight force-pushed the faster-gemv-int8-kernel branch 5 times, most recently from 01edc6b to 523330a Compare January 29, 2025 20:53

robertknight added 2 commits January 29, 2025 21:08

Add SIMD methods for interleaving i8 and i16 vectors

80ade91

robertknight force-pushed the faster-gemv-int8-kernel branch from 523330a to 1b26bb0 Compare January 29, 2025 21:08

robertknight marked this pull request as ready for review January 29, 2025 21:11

robertknight merged commit b89f8b0 into main Jan 29, 2025
2 checks passed

robertknight deleted the faster-gemv-int8-kernel branch January 29, 2025 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize non-transposed int8 GEMV kernel #559

Optimize non-transposed int8 GEMV kernel #559

robertknight commented Jan 29, 2025 •

edited

Loading

Optimize non-transposed int8 GEMV kernel #559

Optimize non-transposed int8 GEMV kernel #559

Conversation

robertknight commented Jan 29, 2025 • edited Loading

robertknight commented Jan 29, 2025 •

edited

Loading