Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

alexsamardzic · 2025-02-05T20:43:08Z

No description provided.

pytorch-bot · 2025-02-05T20:43:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1671

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit d1d96f7 with merge base d00ee41 ():

NEW FAILURES - The following jobs have failed:

Build M1 Wheels / pytorch/ao / upload / wheel-py3_9-cpu (gh)
The process '/usr/bin/git' failed with exit code 1
Build M1 Wheels / pytorch/ao / wheel-py3_9-cpu (gh)
The process '/usr/bin/git' failed with exit code 1
Code Analysis with Ruff / build (3.9) (gh)
test/test_rowwise_scaled_linear_cutlass.py:16:5: F401 [*] torchao.quantization.quant_primitives.ZeroPointDomain imported but unused

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alexsamardzic · 2025-02-05T20:52:48Z

The kernel is ready and passes smoke test.

Remaining tasks:

Write a converter to SM90 sparse semi-structured format
Validate the kernel on proper test inputs
Write the benchmark
Write Python-side code: sparsify/quantize method, Llama generator extension, etc.
Provide that kernel is built with SM90a flags when torchao detects H100 card as SM90
Further unify CUDA code with rowwise_scaled_linear_cutlass code
Implement a meaningful config selection.

@cpuhrsch @drisspg

test/test_rowwise_scaled_linear_sparse_cutlass.py

alexsamardzic · 2025-02-20T18:53:59Z

Testing this PR revealed that the sparse compressor in CUTLASS is not treating -0.0 values as zeros. The upstream fix is proposed here.

alexsamardzic · 2025-02-24T11:05:51Z

This PR is ready for review. It contains:

An implementation of two new CUTLASS-based operators:
- Converter to sparse format for FP8 data and SM9x arch, in torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x.
- Row-wise scaled linear operator implementation for sparse FP8 weight and FP8 activation in torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass. For parallel compilation, each operator template instantiation is in a separate .cu file.
The test for later operator in test/test_rowwise_scaled_linear_sparse_cutlass.py (not all tests will pass at the moment because of [QST] About NaNs generated during FP16->FP8 quantization #1766), and the micro-benchmark in benchmarks/benchmark_rowwise_scaled_linear_sparse_cutlass.py.
The corresponding layout and TensorImpl class implementations in torchao/dtypes/floatx/cutlass_semi_sparse_layout.py. Because of a CUTLASS issue with handling minus zero values when compressing dense to sparse tensor, from_plain() method here contains a temporary workaround (the fix for this CUTLASS issue is in the works: Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp NVIDIA/cutlass#2110).
The remaining glue code on the Python side in torchao/ops.py, torchao/dtypes/affine_quantized_tensor.py and torchao/quantization/quant_api.py, including definition of new config Float8DynamicActivationFloat8SemiSparseWeightConfig for the quantize_() method.
An update to torchao/_models/llama/generate.py script, to make it possible to test the new quantization and linear operator within the context of Llama - run with python generate.py --compile --sparsity semi -q float8dq.
Some minor updates for CUTLASS-based integer W4A4/W4A8 stuff.

I'll address the performance tuning (through CUTLASS run-time config selection), that is mentioned as a remaining task above, in a separate PR.

@drisspg The setup.py changes are about activating gencode flags for SM90a when the build is for SM90 - it's clumsy, but it works, so hopefully we could use this approach until eventually switching to CMake builds for the extensions. I'm adding you as a reviewer because of this; also, please add reviewer(s), whoever may be the most appropriate, for the Python side of the code.

@jcaip If you think there is a need, we may discuss eventually exposing mentioned new operators through SparseSemiStructuredTensor.

@gau-nernst With this PR, it's possible to try CUTLASS-based W4A4 operator from the Llama generator - run with python generate.py --compile --sparsity semi -q int4dq-4 (be sure to fetch the model beforehand - instructions are here). The output is not meaningful, maybe it's because the quantization is too tight, but we may want to investigate it further.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025

cpuhrsch requested a review from jcaip February 5, 2025 21:00

alexsamardzic added float8 sparsity topic: new feature Use this tag if this PR adds a new feature labels Feb 6, 2025

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 2 times, most recently from 5bbcc49 to 6d34b7e Compare February 6, 2025 23:41

jcaip reviewed Feb 7, 2025

View reviewed changes

test/test_rowwise_scaled_linear_sparse_cutlass.py Outdated Show resolved Hide resolved

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 8 times, most recently from bd7288a to f11fae4 Compare February 13, 2025 22:38

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 10 times, most recently from bf65c83 to c0368e3 Compare February 19, 2025 23:33

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch from c0368e3 to 4c63c65 Compare February 20, 2025 19:10

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 2 times, most recently from 05ed2d4 to 6fb6165 Compare February 23, 2025 19:24

alexsamardzic requested review from drisspg and jcaip February 24, 2025 11:06

alexsamardzic mentioned this pull request Feb 24, 2025

Fix wrong scale eps applied #1770

Open

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch 3 times, most recently from ad04e4b to a7197f7 Compare February 25, 2025 21:33

Add CUTLASS-based row-wise scaled sparse FP8 kernel

d1d96f7

alexsamardzic force-pushed the rowwise-scaled-sparse-fp8-cutlass branch from a7197f7 to d1d96f7 Compare February 26, 2025 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

alexsamardzic commented Feb 5, 2025

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 20, 2025

alexsamardzic commented Feb 24, 2025 •

edited

Loading

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

Are you sure you want to change the base?

Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671

Conversation

alexsamardzic commented Feb 5, 2025

pytorch-bot bot commented Feb 5, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1671

❌ 3 New Failures

alexsamardzic commented Feb 5, 2025 • edited Loading

alexsamardzic commented Feb 20, 2025

alexsamardzic commented Feb 24, 2025 • edited Loading

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 5, 2025 •

edited

Loading

alexsamardzic commented Feb 24, 2025 •

edited

Loading