Fast KroneckerProduct.matmul, t_matmul and rmatmul #103

abhijangda · 2024-12-08T01:15:52Z

Hello linear_operator developers,

I have developed a library, FastKron (https://github.com/abhijangda/fastkron), to do fast matrix kronecker-matrix multiply and kronecker-matrix matrix multiply. FastKron performs orders of magnitude faster (0.9x to 21x) than current algorithm on both x86 CPUs and NVIDIA GPUs. The python module, PyFastKron, provides a PyTorch interface with backward pass. You can find more information at https://github.com/abhijangda/fastkron .

This PR integrates KroneckerProductLinearOperator._matmul, KroneckerProductLinearOperator._tmatmul, and KroneckerProductLinearOperator.rmatmul. Looking forward to your reviews and happy to do any changes.

Thank You

abhijangda · 2024-12-08T18:08:43Z

It looks like the CI uses Python 3.8. PyFastKron is build for Python >= 3.9 because PyTorch requires >= 3.9. I can build PyFastKron for 3.8 but I think ideal would be upgrade the Python in CI to >= 3.9 . Let me know what you prefer.

Balandat · 2024-12-08T23:09:11Z

We should just upgrade to 3.9+ as py3.8 is EOL anyway.

cc @jandylin, @SebastianAment re the Kronecker library

abhijangda · 2025-02-03T00:15:23Z

Thanks for upgrading the Python version to 3.10. It looks like a workflow approval is needed to execute CI tests. Would be great if you can approve it and I am happy to answer any questions about FastKron.

Balandat · 2025-02-04T14:16:22Z

FastKron performs orders of magnitude faster (0.9x to 21x) than current algorithm on both x86 CPUs and NVIDIA GPUs

Can you share the benchmarks that you ran for this?

abhijangda · 2025-02-04T18:24:30Z

Install pyfastkron using pip:

pip install -U pyfastkron

Clone the repository including submodules:

git clone --recurse-submodules https://github.com/abhijangda/fastkron.git

I also recommend installing TCMalloc using conda install conda-forge::gperftools or in Ubuntu as sudo apt install google-perftools libgoogle-perftools-dev. TCMalloc is significantly faster than Python's default Glibc malloc.
Using TCMalloc or Glibc malloc would not matter for GPU performance but for CPU TCMalloc will remove the bottleneck from Python's GC on CPU.
Based on how you install TCMalloc the LD_PRELOAD will change.
For conda installation: LD_PRELOAD=<anaconda-env-path>/lib/libtcmalloc.so
For apt installation: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
These instructions are also available in the run_benchmarks.py script.

To evaluate Matrix Kronecker-Matrix (MKM) product (rmatmul in linear_operator) and Kronecker-Matrix Matrix (KMM) product (matmul in linear_operator) using Float and Double on CPU:

LD_PRELOAD=<LD-PRELOAD-PATH> TCMALLOC_RELEASE_RATE=0 python tests/benchmarks/run_benchmarks.py -backend x86 -types float double -dataset large -mmtype kmm mkm -use-pymodule

Similarly, for an NVIDIA GPU:

python tests/benchmarks/run_benchmarks.py -backend cuda -types float double -dataset large -mmtype kmm mkm -use-pymodule

The above scripts use large dataset, where Kronecker matrix are large, like Kronecker matrix of 5 factors of size 8,8 or 2 factors of size 128,128. The large dataset will take a couple of hours for CPU and 1 hour for GPU.
The full dataset will run for cases where Kronecker factors are multiple of 2. The full dataset will take few 10s of hours to run.

abhijangda · 2025-02-06T22:21:38Z

Also, existing results for some benchmarks over GPyTorch on V100/A100 and AMD CPUs with AVX and AVX512 are here: https://github.com/abhijangda/FastKron/blob/main/documents/performance.md .

use pyfastkron for kroneckerproduct (t/r)matmul

05d8eb6

abhijangda force-pushed the main branch from 06f97e2 to 05d8eb6 Compare February 3, 2025 00:00

abhijangda added 2 commits February 3, 2025 20:25

fix for linter

8e62107

ufmt suggested changes

9c145cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast KroneckerProduct.matmul, t_matmul and rmatmul #103

Fast KroneckerProduct.matmul, t_matmul and rmatmul #103

abhijangda commented Dec 8, 2024

abhijangda commented Dec 8, 2024 •

edited

Loading

Balandat commented Dec 8, 2024

abhijangda commented Feb 3, 2025

Balandat commented Feb 4, 2025

abhijangda commented Feb 4, 2025 •

edited

Loading

abhijangda commented Feb 6, 2025

Fast KroneckerProduct.matmul, t_matmul and rmatmul #103

Are you sure you want to change the base?

Fast KroneckerProduct.matmul, t_matmul and rmatmul #103

Conversation

abhijangda commented Dec 8, 2024

abhijangda commented Dec 8, 2024 • edited Loading

Balandat commented Dec 8, 2024

abhijangda commented Feb 3, 2025

Balandat commented Feb 4, 2025

abhijangda commented Feb 4, 2025 • edited Loading

abhijangda commented Feb 6, 2025

abhijangda commented Dec 8, 2024 •

edited

Loading

abhijangda commented Feb 4, 2025 •

edited

Loading