Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize non-transposed int8 GEMV kernel #559

Merged
merged 2 commits into from
Jan 29, 2025
Merged

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Jan 29, 2025

Previously the non-transposed int8 GEMV kernel iterated over the B matrix in blocks of I32_VEC_LEN columns, 4 rows. At each step it loaded I32_VEC_LEN int8 values and sign-extended to i32. These were interleaved to give a 4x4 transposed tile of B and two dot product instructions were used to update I32_VEC_LEN dot products and column sums. This version reduces the number of loads from B by loading 4 rows of I8_VEC_LEN columns at a time, interleaving to give I32_VEC_LEN x 4x4 transposed tiles and using I32_VEC_LEN * 2 dot product instructions to update I8_VEC_LEN dot products and column sums.

On M3 Pro (5 perf cores) this improves GEMV performance for the non-transposed case from 45 GF -> 61 GF. This compares to 90 GF for the case where B is transposed. For wasmtime (single core) this goes from 6.8 -> 16.6 GF compared to 28 when transposed.

A downside of this change is that the number of tail columns is now much larger (up to 15, 31, or 63 on AVX512). This could possibly be handled by using masked loads for the tail.

  • Add SimdInt::{zip_lo_i8, zip_hi_i8, zip_lo_i16, zip_hi_i16} methods for interleaving packed 8 and 16-bit ints
  • Use these methods to implement a more efficient GEMV as described above
  • Fix x86 build for macOS that was broken in an earlier commit (and is not tested in CI :( )

TODO:

  • Fix issue with zip on AVX2 x64 due to it operating on 128-bit blocks instead of the whole 256-bit vec
  • Test performance impact on non-Arm platforms
  • Fix x64 issue with zip methods, but for AVX-512

@robertknight robertknight force-pushed the faster-gemv-int8-kernel branch 5 times, most recently from 01edc6b to 523330a Compare January 29, 2025 20:53
Instead of loading 4 rows of 4 elements from B at a time, load 4 rows of 16
elements and interleave to give 4 x 4x4 transposed tiles.

The inner loop over K was also changed to manually increment `k` instead of
using `range_chunks_exact` as this generated slightly better code.

On an M3 Pro this improved the `bench_gemm_mix` gemv benchmark for int8
from ~45 GFLOPS to ~60 GFLOPS for the non-transposed case.
@robertknight robertknight force-pushed the faster-gemv-int8-kernel branch from 523330a to 1b26bb0 Compare January 29, 2025 21:08
@robertknight robertknight marked this pull request as ready for review January 29, 2025 21:11
@robertknight robertknight merged commit b89f8b0 into main Jan 29, 2025
2 checks passed
@robertknight robertknight deleted the faster-gemv-int8-kernel branch January 29, 2025 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant