Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast KroneckerProduct.matmul, t_matmul and rmatmul #103

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

abhijangda
Copy link

Hello linear_operator developers,

I have developed a library, FastKron (https://github.com/abhijangda/fastkron), to do fast matrix kronecker-matrix multiply and kronecker-matrix matrix multiply. FastKron performs orders of magnitude faster (0.9x to 21x) than current algorithm on both x86 CPUs and NVIDIA GPUs. The python module, PyFastKron, provides a PyTorch interface with backward pass. You can find more information at https://github.com/abhijangda/fastkron .

This PR integrates KroneckerProductLinearOperator._matmul, KroneckerProductLinearOperator._tmatmul, and KroneckerProductLinearOperator.rmatmul. Looking forward to your reviews and happy to do any changes.

Thank You

@abhijangda
Copy link
Author

abhijangda commented Dec 8, 2024

It looks like the CI uses Python 3.8. PyFastKron is build for Python >= 3.9 because PyTorch requires >= 3.9. I can build PyFastKron for 3.8 but I think ideal would be upgrade the Python in CI to >= 3.9 . Let me know what you prefer.

@Balandat
Copy link
Collaborator

Balandat commented Dec 8, 2024

We should just upgrade to 3.9+ as py3.8 is EOL anyway.

cc @jandylin, @SebastianAment re the Kronecker library

@abhijangda
Copy link
Author

Thanks for upgrading the Python version to 3.10. It looks like a workflow approval is needed to execute CI tests. Would be great if you can approve it and I am happy to answer any questions about FastKron.

@Balandat
Copy link
Collaborator

Balandat commented Feb 4, 2025

FastKron performs orders of magnitude faster (0.9x to 21x) than current algorithm on both x86 CPUs and NVIDIA GPUs

Can you share the benchmarks that you ran for this?

@abhijangda
Copy link
Author

abhijangda commented Feb 4, 2025

Install pyfastkron using pip:

pip install -U pyfastkron

Clone the repository including submodules:

git clone --recurse-submodules https://github.com/abhijangda/fastkron.git

I also recommend installing TCMalloc using conda install conda-forge::gperftools or in Ubuntu as sudo apt install google-perftools libgoogle-perftools-dev. TCMalloc is significantly faster than Python's default Glibc malloc.
Using TCMalloc or Glibc malloc would not matter for GPU performance but for CPU TCMalloc will remove the bottleneck from Python's GC on CPU.
Based on how you install TCMalloc the LD_PRELOAD will change.
For conda installation: LD_PRELOAD=<anaconda-env-path>/lib/libtcmalloc.so
For apt installation: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
These instructions are also available in the run_benchmarks.py script.

To evaluate Matrix Kronecker-Matrix (MKM) product (rmatmul in linear_operator) and Kronecker-Matrix Matrix (KMM) product (matmul in linear_operator) using Float and Double on CPU:

LD_PRELOAD=<LD-PRELOAD-PATH> TCMALLOC_RELEASE_RATE=0 python tests/benchmarks/run_benchmarks.py -backend x86 -types float double -dataset large -mmtype kmm mkm -use-pymodule

Similarly, for an NVIDIA GPU:

python tests/benchmarks/run_benchmarks.py -backend cuda -types float double -dataset large -mmtype kmm mkm -use-pymodule

The above scripts use large dataset, where Kronecker matrix are large, like Kronecker matrix of 5 factors of size 8,8 or 2 factors of size 128,128. The large dataset will take a couple of hours for CPU and 1 hour for GPU.
The full dataset will run for cases where Kronecker factors are multiple of 2. The full dataset will take few 10s of hours to run.

@abhijangda
Copy link
Author

Also, existing results for some benchmarks over GPyTorch on V100/A100 and AMD CPUs with AVX and AVX512 are here: https://github.com/abhijangda/FastKron/blob/main/documents/performance.md .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants