PyTorch Custom Operator Integration #1544

matthewdouglas · 2025-02-27T18:49:42Z

Overview

This PR introduces the initial scaffolding to integrate PyTorch Custom Operators as the primary mechanism for dispatching to device-specific operator implementation.

As outlined in the related RFC #1545, the intent is that this will supersede the previous backend registration interface that was developed on the multi-backend-refactor branch. The baseline CUDA operators are established in this PR, and the implementation for additional backends is to be ported over to this new interface.

Why Custom Ops?

Registering operators with torch.library allows us to take advantage of the existing device dispatch mechanisms in PyTorch.
We can treat calls to functionality in our CUDA kernels, or other low-level backend implementations, as opaque for improved torch.compile support.
We can provide naive implementations of operators with only PyTorch code as a fallback option.
This helps to simplify the development for additional backends, while taking an idiomatic modern PyTorch approach.

Operator Definitions

We broadly categorize operator functionality into three feature groups, though there can be some overlap.

LLM.int8()

Inference requirements

int8_vectorwise_quant(A: Tensor, threshold: float = 0.0) -> (Tensor, Tensor, Tensor?)
- Implements the LLM.int8() quantization algorithm with the specified threshold.
- Returns an int8 quantized tensor, a float32 tensor containing the scaling stats, and an optional int32 tensor containing a list of column indices with outliers present.
int8_linear_dequant(A: Tensor, B: Tensor, row_stats: Tensor, col_stats: Tensor, bias: Tensor?, dtype=torch.float16) Name may change
- By default, this is a composition of the below two operators. The choice can be made to implement one fused operator or two separately.
  - int8_linear_matmul(A: Tensor, B: Tensor) -> Tensor
    - Performs an 8-bit integer matrix multiplication between two int8 matrices.
    - Returns an int32 matrix: A @ B.T
  - int8_mm_dequant(A: Tensor, row_stats: Tensor, col_stats: Tensor, dtype=torch.float16, bias: Tensor?) -> Tensor
    - Dequantizes the result of a quantized 8-bit matrix multiplication with an optional fused bias.
    - The result is returned in the specified dtype, which is always torch.float16 for the current CUDA implementation.

Optional

int8_vectorwise_dequant(A: Tensor, stats: Tensor)
- Dequantizes an int8 tensor that was quantized with int8_vectorwise_quant.
- A default implementation in PyTorch is provided, which should work with any backend.
- This is a utility utilized by Transformers, Diffusers, PEFT, and others.
int8_double_quant(A: Tensor, threshold: float = 0.0)
- Quantizes the input tensor using the LLM.int8() algorithm across both dimensions.
- This is only useful for full int8 training (e.g. not LoRA), and as such, we only recommend implementing int8_vectorwise_quant.

NF4/FP4

Minimal requirements

dequantize_4bit(A: Tensor, absmax: Tensor, blocksize: int, quant_type: Literal["nf4" | "fp4"], shape: int[], dtype) -> Tensor
- Dequantizes a packed 4bit tensor into the specified floating point dtype.
- Note: Unlike bitsandbytes.functional.dequantize_4bit, this operator does not dequantize the absmax tensor. If utilized, dequantize_blockwise must be performed first.
quantize_4bit(A: Tensor, blocksize: int, quant_type: Literal["nf4" | "fp4"], quant_storage=torch.uint8) -> (Tensor, Tensor)
- Quantizes a floating point tensor into a packed 4bit tensor.
- Returns a tensor with the quantized data packed into into bytes, backed by the storage type specified. The float32 absmax scaling factors are additionally returned.
- Note: Unlike bitsandbytes.functional.quantize_4bit, this operator does not quantize the absmax tensor. If utilized, quantize_blockwise must be performed first.

Double quantization (aka `compressed_statistics` or `nested`)

dequantize_blockwise(A: Tensor, absmax: Tensor, code: Tensor, blocksize: int, dtype) -> Tensor
- Dequantizes an 8bit tensor that was quantized with quantize_blockwise
- The dequantized tensor with the specified dtype.
quantize_blockwise(A: Tensor, code: Tensor, blocksize: int) -> (Tensor, Tensor)
- Quantizes into an 8bit blocked data type defined by code.
- The blocksize will typically be 256 for usage with NF4/FP4 and optimizers.
- Returns the quantized tensor in uint8 format, along with float32 absmax.

Optional

gemv_4bit
- Fast path for bsz=1 inference with 4bit quantization. This operator is subject to some future revision.

Optimizers

Optimizer functionality will be implemented to support the custom operators in a future update.

github-actions · 2025-02-27T18:53:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zou3519 · 2025-03-03T19:30:34Z

bitsandbytes/_ops.py

+torch.library.define(
+    "bitsandbytes::int8_linear_dequant",
+    "(Tensor A, Tensor B, Tensor row_stats, Tensor col_stats, Tensor? bias=None, ScalarType dtype=float16) -> Tensor",
+)


Hey! I'm the main maintainer of custom operators in PyTorch. I'm curious -- why not use the torch.library.custom_op API instead of torch.library.define?

It would look something like:

@torch.library.custom_op("bitsandbytes::int8_linear_dequant", mutates_args=()) def int8_linear_dequant(A: Tensor, B: Tensor, row_stats: Tensor, col_stats: Tensor, bias: Optional[Tensor], dtype: torch.dtype) -> Tensor: raise NotImplementedError("") @int8_linear_dequant.register_fake def _( A: torch.Tensor, B: torch.Tensor, row_stats: torch.Tensor, col_stats: torch.Tensor, bias: Optional[torch.Tensor] = None, dtype=torch.float16, ) -> torch.Tensor: shapeC = (*A.shape[:-1], B.shape[0]) return torch.empty(shapeC, device=A.device, dtype=dtype)

We generally encourage people to use torch.library.custom_op because the custom ops produced from it are guarded from various footguns when compared to torch.library.Library.define

Hey! Thanks for the feedback :)

While the custom_op API does look to be convenient, there are two main reasons it was avoided:

I'm not sure if we're ready to bump our minimum PyTorch requirement to 2.4.0+. With that said, we're not strictly opposed to that, however.

I've heard from some others that there was significant overhead introduced with the use of custom_op:

E.g. the decision was made in quanto to not use it: huggingface/optimum-quanto@f36a6ac

It is mentioned here in vLLM: [torch.compile] directly register custom op vllm-project/vllm#9896 and [Kernel] Register punica ops directly vllm-project/vllm#10522

PyTorch issue: Custom operators registered via decorator slower than ops registered via torch.Library.{define, impl} pytorch/pytorch#139500

I am curious, is it still reasonable to make use of infer_schema, and is that API available in torch < 2.4?

Thanks for the feedback! It's not clear to me if we have fully fixed the performance issues, but I will check.
torch.library.infer_schema is only available in 2.5+. So if your goal is to support older pytorch versions you are doing the right thing

matthewdouglas added 14 commits January 28, 2025 09:20

Sketch out first custom op registration

6268912

Add note

04e1bc6

Merge branch 'main' into customop-refactoring

d5df4c6

Initial int8 op registration

04482ff

Cleanup some deprecated functions.

2813571

Int8 ops updates; tests

4ad1d9e

Implement 4bit quant/dequant ops

e9c79cf

Fix nested quant

9d0f459

cleanup

f360a08

Test improvements

45ead33

Clean up and improve tests

6aeea81

Add higher level custom op for int8 matmul + dequant + bias

cbd1670

Add gemv 4bit custom op

db07f4e

Cleanup

23eba7a

matthewdouglas added high priority (first issues that will be worked on) cross-platform labels Feb 27, 2025

matthewdouglas mentioned this pull request Feb 27, 2025

[RFC] PyTorch Custom Operators & Multi-Backend Support #1545

Open

matthewdouglas requested a review from Titus-von-Koeller February 27, 2025 19:33

zou3519 reviewed Mar 3, 2025

View reviewed changes

Titus-von-Koeller mentioned this pull request Mar 5, 2025

[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

Open

3 tasks

matthewdouglas added this to the v0.46.0 milestone Mar 5, 2025

matthewdouglas marked this pull request as ready for review March 5, 2025 15:28

Implement out kwarg overloads for custom ops

2d5b2cc

matthewdouglas mentioned this pull request Mar 7, 2025

Fix CPU dequantization to use nested dequantized scaling constant #1549

Merged

matthewdouglas added 3 commits March 7, 2025 18:37

Update PyTorch minimum to 2.1

6172770

Deprecation updates

242c602

Deprecation updates

25368bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch Custom Operator Integration #1544

PyTorch Custom Operator Integration #1544

matthewdouglas commented Feb 27, 2025 •

edited

Loading

github-actions bot commented Feb 27, 2025

zou3519 Mar 3, 2025

matthewdouglas Mar 3, 2025

zou3519 Mar 4, 2025

PyTorch Custom Operator Integration #1544

Are you sure you want to change the base?

PyTorch Custom Operator Integration #1544

Conversation

matthewdouglas commented Feb 27, 2025 • edited Loading

Overview

Why Custom Ops?

Operator Definitions

LLM.int8()

Inference requirements

Optional

NF4/FP4

Minimal requirements

Double quantization (aka compressed_statistics or nested)

Optional

Optimizers

github-actions bot commented Feb 27, 2025

zou3519 Mar 3, 2025

Choose a reason for hiding this comment

matthewdouglas Mar 3, 2025

Choose a reason for hiding this comment

zou3519 Mar 4, 2025

Choose a reason for hiding this comment

matthewdouglas commented Feb 27, 2025 •

edited

Loading

Double quantization (aka `compressed_statistics` or `nested`)