Performance suboptimal for very small matrices #857

xor2k · 2025-02-15T10:39:31Z

Dear BLIS Team,

I'm currently working to improve Numpy's matmul for the strided case and I ran a large grid search with different BLAS frameworks, see

numpy/numpy#23752 (comment)

Here a repost of the plots:

blas_benchmark_v2.pdf

The plots show the improvement of performance of the respective BLAS framework plus copying over naïve matrix multiplication.

In the case of BLIS, a red (performance degradation instead of speedup) hyperbola for very small matrices exists and is more intense than in other frameworks, e.g. OpenBLAS or AOCL in a small triangular area on the same machine. Maybe there is still some room for improvement. I can do more benchmarks and plots like that if interested and also provide some code.

Best from Berlin, Michael

The text was updated successfully, but these errors were encountered:

devinamatthews · 2025-02-15T15:46:23Z

A lot of that is probably just from function call overhead and a little bit of dynamic memory allocation. The AOCL version of BLIS includes a special codepath for very small matrices (squarish) which is the strange triangle you see. BLIS does have a mechanism for optimizing multiplication of small matrices (intended for rectangular "tall and skinny" matrices); the threshold for when this turns on a fairly large (~250) but I would think there should be some signature like for the very small matrix optimization in AOCL.

Long story short, BLIS really has no optimizations for matrices like 10x10 or smaller and can be considered to have a number of anti-optimizations due to the way it is designed. This was a choice to provide the best performance over the widest range of matrices and functionality without huge developer burden.

devinamatthews · 2025-02-15T15:47:25Z

BTW I'm pretty sure that the veclib implementation doesn't switch to GPU at larger sizes but uses the Apple AMX accelerator which is on-die.

devinamatthews · 2025-02-15T15:49:12Z

Regarding batching, if this is a common use-case then that would be a possible avenue to increase performance as doing batched gemm properly (through the existing BLAS-like APIs) would naturally cut out a lot of the overhead I mentioned.

xor2k · 2025-02-15T19:13:48Z

Thanks for the reply, that's very interesting, especially that Apple has these undocumented AMX cores, I'll mention that in the original PR.

Would a simple check like if (m * p * n * n < some_threshold) return do_naive_gemm(...); really complicate the library so much or is the case just too uninteresting?

Yeah, from my point of view, gemm_batch would be interesting, though I'm not a Numpy maintainer, just somebody who worked on that PR. Is there a way to portably and properly access those BLAS-like functions at runtime? The name gemm_batch at least seems to be shared among OpenBLAS and MKL.

How long does it take and how complicated is it to add something to the BLAS Standard? Is that even a thing? I changed the core statistic methods of a standard myself, took many years and an experts opinion, but I think BLAS is a completely different level 😅

leekillough · 2025-02-15T21:49:24Z

@xor2k:

There is no "official" BLAS standard à la ISO; it is mostly a set of de facto standards which have developed since the BLAS Level-1 was originally proposed and later standardized in the 1970s, when vector supercomputers were dominant in HPC.

Later, with the decline of vector and the rise of superscalar RISC, BLAS Level 2 and Level 3 were proposed.

LINPACK, LAPACK, and other math packages were the main users of BLAS and the main drivers of its APIs.

BLIS is a superset of what is often termed "Legacy BLAS" or "Reference BLAS", the original Fortran 77 APIs of BLAS. This legacy BLAS has had many forks and proposals, but those have never been widely adopted away from the original Legacy BLAS 1/2/3 APIs.

If you have ideas for a new BLIS routine, or new ways to make BLIS run faster on a subset of problems like small sizes, then you can propose one, but keep in mind that there are certain conventions in BLIS like runtime contexts and target machine configurations that new routines will need to fit in. Making a change across all of BLIS regardless of target machine will be scrutinized more heavily than making changes for a specific target machine. The high-level BLIS APIs cannot be changed as easily as the low-level implementation of a specific kernel, possibly for a specific target machine.

As others said, the performance of BLIS on very small problem sizes has been sacrificed to get higher performance across a wider range of medium-to-large problem sizes, to increase portability and programmer productivity, and to ease adoption on new architectures.

I have not looked into detail at your problem type, but if it's very small matrices (like 10x10), then automatically generating and/or auto-tuning specialized kernels for these problems would probably be better than trying to change BLIS to accommodate them. There are a lot of projects in ML, for example, which automatically generate optimized assembly code for small kernels based on problem size, such as with a Python script.

jeffhammond · 2025-02-16T08:07:09Z

Batched BLAS on CPU isn't necessary. OpenMP loop over LIBXSMM is going to be close to optimal. The key part is that LIBXSMM optimizes for small sizes much better than other libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance suboptimal for very small matrices #857

Performance suboptimal for very small matrices #857

xor2k commented Feb 15, 2025

devinamatthews commented Feb 15, 2025

devinamatthews commented Feb 15, 2025

devinamatthews commented Feb 15, 2025

xor2k commented Feb 15, 2025

leekillough commented Feb 15, 2025

jeffhammond commented Feb 16, 2025

Performance suboptimal for very small matrices #857

Performance suboptimal for very small matrices #857

Comments

xor2k commented Feb 15, 2025

devinamatthews commented Feb 15, 2025

devinamatthews commented Feb 15, 2025

devinamatthews commented Feb 15, 2025

xor2k commented Feb 15, 2025

leekillough commented Feb 15, 2025

jeffhammond commented Feb 16, 2025