-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance suboptimal for very small matrices #857
Comments
A lot of that is probably just from function call overhead and a little bit of dynamic memory allocation. The AOCL version of BLIS includes a special codepath for very small matrices (squarish) which is the strange triangle you see. BLIS does have a mechanism for optimizing multiplication of small matrices (intended for rectangular "tall and skinny" matrices); the threshold for when this turns on a fairly large (~250) but I would think there should be some signature like for the very small matrix optimization in AOCL. Long story short, BLIS really has no optimizations for matrices like 10x10 or smaller and can be considered to have a number of anti-optimizations due to the way it is designed. This was a choice to provide the best performance over the widest range of matrices and functionality without huge developer burden. |
BTW I'm pretty sure that the veclib implementation doesn't switch to GPU at larger sizes but uses the Apple AMX accelerator which is on-die. |
Regarding batching, if this is a common use-case then that would be a possible avenue to increase performance as doing batched gemm properly (through the existing BLAS-like APIs) would naturally cut out a lot of the overhead I mentioned. |
Thanks for the reply, that's very interesting, especially that Apple has these undocumented AMX cores, I'll mention that in the original PR. Would a simple check like Yeah, from my point of view, gemm_batch would be interesting, though I'm not a Numpy maintainer, just somebody who worked on that PR. Is there a way to portably and properly access those BLAS-like functions at runtime? The name How long does it take and how complicated is it to add something to the BLAS Standard? Is that even a thing? I changed the core statistic methods of a standard myself, took many years and an experts opinion, but I think BLAS is a completely different level 😅 |
There is no "official" BLAS standard à la ISO; it is mostly a set of de facto standards which have developed since the BLAS Level-1 was originally proposed and later standardized in the 1970s, when vector supercomputers were dominant in HPC. Later, with the decline of vector and the rise of superscalar RISC, BLAS Level 2 and Level 3 were proposed. LINPACK, LAPACK, and other math packages were the main users of BLAS and the main drivers of its APIs. BLIS is a superset of what is often termed "Legacy BLAS" or "Reference BLAS", the original Fortran 77 APIs of BLAS. This legacy BLAS has had many forks and proposals, but those have never been widely adopted away from the original Legacy BLAS 1/2/3 APIs. If you have ideas for a new BLIS routine, or new ways to make BLIS run faster on a subset of problems like small sizes, then you can propose one, but keep in mind that there are certain conventions in BLIS like runtime contexts and target machine configurations that new routines will need to fit in. Making a change across all of BLIS regardless of target machine will be scrutinized more heavily than making changes for a specific target machine. The high-level BLIS APIs cannot be changed as easily as the low-level implementation of a specific kernel, possibly for a specific target machine. As others said, the performance of BLIS on very small problem sizes has been sacrificed to get higher performance across a wider range of medium-to-large problem sizes, to increase portability and programmer productivity, and to ease adoption on new architectures. I have not looked into detail at your problem type, but if it's very small matrices (like 10x10), then automatically generating and/or auto-tuning specialized kernels for these problems would probably be better than trying to change BLIS to accommodate them. There are a lot of projects in ML, for example, which automatically generate optimized assembly code for small kernels based on problem size, such as with a Python script. |
Batched BLAS on CPU isn't necessary. OpenMP loop over LIBXSMM is going to be close to optimal. The key part is that LIBXSMM optimizes for small sizes much better than other libraries. |
Dear BLIS Team,
I'm currently working to improve Numpy's
matmul
for the strided case and I ran a large grid search with different BLAS frameworks, seenumpy/numpy#23752 (comment)
Here a repost of the plots:
blas_benchmark_v2.pdf
The plots show the improvement of performance of the respective BLAS framework plus copying over naïve matrix multiplication.
In the case of BLIS, a red (performance degradation instead of speedup) hyperbola for very small matrices exists and is more intense than in other frameworks, e.g. OpenBLAS or AOCL in a small triangular area on the same machine. Maybe there is still some room for improvement. I can do more benchmarks and plots like that if interested and also provide some code.
Best from Berlin, Michael
The text was updated successfully, but these errors were encountered: