Improve Block pointer GEMM performance #3328

alexbaden · 2025-02-01T18:21:24Z

Comparing a simple GEMM kernel using block pointers with PyTorch (oneDNN) on PVC 1100:

	gbps	tflops	Triton % Torch tflops
Torch A x B	220.7031745	136.4162522
Triton A x B	174.9541537	108.8349622	79.78%
Torch A x B^T	203.4142925	126.5652739
Triton A x B^T	104.0757365	64.59003281	51.03%
Torch A^T x B	218.4317726	135.8922436
Triton A^T x B	42.5237584	26.39050218	19.42%
Torch A^T x B^T	150.3182902	93.28844194
Triton A^T x B^T	34.43304927	21.36935906	22.91%

(captured 2025 Feb 1)

LIBIGC1_VERSION=2.5.11-1077
LEVEL_ZERO_VERSION=1.19.2.0-1076
AGAMA_VERSION=1077
GPU_DEVICE=Intel(R) Data Center GPU Max 1100
TORCH_VERSION=2.7.0
COMPILER_VERSION=2025.0.4

Umbrella issue to track improvements for each of the four individual GEMM kernels above.

The text was updated successfully, but these errors were encountered:

alexbaden added performance umbrella labels Feb 1, 2025

alexbaden self-assigned this Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Block pointer GEMM performance #3328

Improve Block pointer GEMM performance #3328

alexbaden commented Feb 1, 2025

Improve Block pointer GEMM performance #3328

Improve Block pointer GEMM performance #3328

Comments

alexbaden commented Feb 1, 2025