1.1.0
This release completes the implementation of our custom linear layer by adding a fully-functional backward pass, enabling end-to-end training.
What's New
-
Complete Backward Pass Implementation: Added three specialized Triton kernels for efficient backpropagation:
backward_input_kernel
: Computes gradients with respect to inputbackward_weight_kernel
: Computes gradients with respect to weightsfused_relu_bias_backward_kernel
: Fused computation of bias gradients and ReLU backward pass
-
Performance Optimizations:
- Kernel fusion to minimize memory operations
- Autotuned configurations for optimal performance
- Block-based computation patterns for efficient GPU utilization
- Mixed precision (float32 for computation, float16 for storage)
Previous Release
- Forward pass implementation with fused linear transformation and ReLU activation
Technical Details
The backward pass maintains the same performance philosophy as the forward pass:
- Leverages Triton for GPU acceleration
- Uses autotuning to optimize kernel configurations
- Implements efficient memory access patterns
- Maintains numerical stability through careful handling of data types
Usage
The layer can now be used as a drop-in replacement for nn.Linear
in training scenarios:
layer = TritonLinear(in_features=512, out_features=256)
optimizer = torch.optim.Adam(layer.parameters())
Forward pass
output = layer(input_tensor)
loss = criterion(output, target)
Backward pass (now supported!)
loss.backward()
optimizer.step()