Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance suboptimal for very small matrices #857

Open
xor2k opened this issue Feb 15, 2025 · 6 comments
Open

Performance suboptimal for very small matrices #857

xor2k opened this issue Feb 15, 2025 · 6 comments

Comments

@xor2k
Copy link

xor2k commented Feb 15, 2025

Dear BLIS Team,

I'm currently working to improve Numpy's matmul for the strided case and I ran a large grid search with different BLAS frameworks, see

numpy/numpy#23752 (comment)

Here a repost of the plots:

Image
blas_benchmark_v2.pdf

The plots show the improvement of performance of the respective BLAS framework plus copying over naïve matrix multiplication.

In the case of BLIS, a red (performance degradation instead of speedup) hyperbola for very small matrices exists and is more intense than in other frameworks, e.g. OpenBLAS or AOCL in a small triangular area on the same machine. Maybe there is still some room for improvement. I can do more benchmarks and plots like that if interested and also provide some code.

Best from Berlin, Michael

@devinamatthews
Copy link
Member

A lot of that is probably just from function call overhead and a little bit of dynamic memory allocation. The AOCL version of BLIS includes a special codepath for very small matrices (squarish) which is the strange triangle you see. BLIS does have a mechanism for optimizing multiplication of small matrices (intended for rectangular "tall and skinny" matrices); the threshold for when this turns on a fairly large (~250) but I would think there should be some signature like for the very small matrix optimization in AOCL.

Long story short, BLIS really has no optimizations for matrices like 10x10 or smaller and can be considered to have a number of anti-optimizations due to the way it is designed. This was a choice to provide the best performance over the widest range of matrices and functionality without huge developer burden.

@devinamatthews
Copy link
Member

BTW I'm pretty sure that the veclib implementation doesn't switch to GPU at larger sizes but uses the Apple AMX accelerator which is on-die.

@devinamatthews
Copy link
Member

Regarding batching, if this is a common use-case then that would be a possible avenue to increase performance as doing batched gemm properly (through the existing BLAS-like APIs) would naturally cut out a lot of the overhead I mentioned.

@xor2k
Copy link
Author

xor2k commented Feb 15, 2025

Thanks for the reply, that's very interesting, especially that Apple has these undocumented AMX cores, I'll mention that in the original PR.

Would a simple check like if (m * p * n * n < some_threshold) return do_naive_gemm(...); really complicate the library so much or is the case just too uninteresting?

Yeah, from my point of view, gemm_batch would be interesting, though I'm not a Numpy maintainer, just somebody who worked on that PR. Is there a way to portably and properly access those BLAS-like functions at runtime? The name gemm_batch at least seems to be shared among OpenBLAS and MKL.

How long does it take and how complicated is it to add something to the BLAS Standard? Is that even a thing? I changed the core statistic methods of a standard myself, took many years and an experts opinion, but I think BLAS is a completely different level 😅

@leekillough
Copy link
Collaborator

@xor2k:

There is no "official" BLAS standard à la ISO; it is mostly a set of de facto standards which have developed since the BLAS Level-1 was originally proposed and later standardized in the 1970s, when vector supercomputers were dominant in HPC.

Later, with the decline of vector and the rise of superscalar RISC, BLAS Level 2 and Level 3 were proposed.

LINPACK, LAPACK, and other math packages were the main users of BLAS and the main drivers of its APIs.

BLIS is a superset of what is often termed "Legacy BLAS" or "Reference BLAS", the original Fortran 77 APIs of BLAS. This legacy BLAS has had many forks and proposals, but those have never been widely adopted away from the original Legacy BLAS 1/2/3 APIs.

If you have ideas for a new BLIS routine, or new ways to make BLIS run faster on a subset of problems like small sizes, then you can propose one, but keep in mind that there are certain conventions in BLIS like runtime contexts and target machine configurations that new routines will need to fit in. Making a change across all of BLIS regardless of target machine will be scrutinized more heavily than making changes for a specific target machine. The high-level BLIS APIs cannot be changed as easily as the low-level implementation of a specific kernel, possibly for a specific target machine.

As others said, the performance of BLIS on very small problem sizes has been sacrificed to get higher performance across a wider range of medium-to-large problem sizes, to increase portability and programmer productivity, and to ease adoption on new architectures.

I have not looked into detail at your problem type, but if it's very small matrices (like 10x10), then automatically generating and/or auto-tuning specialized kernels for these problems would probably be better than trying to change BLIS to accommodate them. There are a lot of projects in ML, for example, which automatically generate optimized assembly code for small kernels based on problem size, such as with a Python script.

@jeffhammond
Copy link
Member

Batched BLAS on CPU isn't necessary. OpenMP loop over LIBXSMM is going to be close to optimal. The key part is that LIBXSMM optimizes for small sizes much better than other libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants