A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
This project is a step-by-step learning journey where we implement various types of Triton kernels—from the simplest examples to more advanced applications—while exploring GPU programming with Triton. The goal of this repository is to help you (and others) get comfortable with Triton by:
- Starting simple: begin with basic kernels such as vector addition, and understand the building blocks of writing GPU code with Triton.
- Incremental learning: each day introduces a new challenge, progressively covering more complex topics, techniques, and optimizations.
- Hands-on experience: code, test, and benchmark your kernels against standard implementations (e.g., PyTorch) to see performance improvements and better understand GPU behavior.
Daily challenges: every day, a new challenge is posted in this repository. Each challenge focuses on a specific aspect of Triton, such as:
- Basic operations (e.g., vector addition)
- Memory management and optimizations
- Advanced indexing and dynamic shapes
- Multi-dimensional kernels
- Reduction operations and more
- Detailed explanations: each kernel comes with an in-depth explanation of the code, helping you understand the concepts behind the implementation.
- Benchmarking and stress tests: learn how to measure performance by comparing custom Triton kernels with standard PyTorch implementations. Get hands-on experience with benchmarking on real-world GPU workloads.
Day | Kernel | Description |
---|---|---|
#1 | Constant add | This challenge is the first puzzle in our Daily Triton Challenge series. The goal is to write a Triton kernel that adds a constant value to each element of a vector. |
#2 | Add two vectors | Simple example of how to add two vectors using a custom GPU kernel written in Triton and compares the result to a standard PyTorch implementation. |
#3 | Add two vectors with speed benchmarking | This is almost the same as #2 but we meaesure kernel execution speed and compare it to Pytorch implementation. |
Gain deeper insights into Triton through these detailed articles:
- Understanding the Triton Tutorials Part 1 and Part 2
- Softmax in OpenAI Triton -> more detailed Fused Softmax Triton example explanation (step-by-step)
- Accelerating AI with Triton: A Deep Dive into Writing High-Performance GPU Code
- Accelerating Triton Dequantization Kernels for GPTQ
- Triton Tutorial #2
- Triton: OpenAI’s Innovative Programming Language for Custom Deep-Learning Primitives
- Triton Kernel Compilation Stages
- Deep Dive into Triton Internals Part 1, Part 2 and Part 3
- Exploring Triton GPU programming for neural networks in Java
- Using User-Defined Triton Kernels with torch.compile
- Mamba: The Hard Way
- FP8: Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton
- FP8: Deep Dive on CUTLASS Ping-Pong GEMM Kernel
- FP8: Deep Dive on the Hopper TMA Unit for FP8 GEMMs
- Technical Review on PyTorch2.0 and Triton
- Towards Agile Development of Efficient Deep Learning Operators
- Developing Triton Kernels on AMD GPUs
- CUDA-Free Inference for LLMs
Explore the academic foundation of Triton:
Learn by watching these informative videos:
- Lecture 14: Practitioners Guide to Triton and notebook
- Lecture 29: Triton Internals
- Intro to Triton: Coding Softmax in PyTorch
- Triton Vector Addition Kernel, part 1: Making the Shift to Parallel Programming
- Tiled Matrix Multiplication in Triton - part 1
- Flash Attention derived and coded from first principles with Triton (Python)
Watch Triton community meetups to be up to date with Triton recent topics.
Challenge yourself with these engaging puzzles:
Enhance your Triton development workflow with these tools:
- Triton Deja-vu Framework to reduce autotune overhead of triton-lang to zero for well known deployments. This small framework is based on the Triton autotuner and contributes two features to the Triton community: 1. store and safely restore autotuner states using JSON files, 2. ConfigSpaces to explore a defined space exhaustively. Additionally, it allows to use heuristics in combination with the autotuner.
- Triton Profiler and video explaining how to use it Dev Tools: Proton/Interpreter
- Triton-Viz: A Visualization Toolkit for Programming with Triton
- Make Triton easier - Triton-util provides simple higher-level abstractions for frequent but repetitive tasks. This allows you to write code that is closer to how you actually think.
- TritonBench is a collection of PyTorch operators used to evaluation the performance of Triton, and its integration with PyTorch.
Catch up on the latest advancements from Triton Conferences:
Explore practical implementations with these sample kernels:
- attorch is a subset of PyTorch's nn module, written purely in Python using OpenAI's Triton
- FlagGems is a high-performance general operator library implemented in OpenAI Triton. It aims to provide a suite of kernel functions to accelerate LLM training and inference.
- Kernl lets you run Pytorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
- Linger-Kernel
- Triton Kernels for Efficient Low-Bit Matrix Multiplication
- Unsloth Kernels
- This is attempt at implementing a Triton kernel for GPTQ inference. This code is based on the GPTQ-for-LLaMa codebase, which is itself based on the GPTQ codebase.
- triton-index - Catalog openly available Triton kernels
- Triton-based implementation of Sparse Mixture-of-Experts (SMoE) on GPUs
- Variety of Triton and CUDA kernels for training and inference
- EquiTriton is a project that seeks to implement high-performance kernels for commonly used building blocks in equivariant neural networks, enabling compute efficient training and inference
- Expanded collection of Neural Network activation functions and other function kernels in Triton by OpenAI.
- Fused kernels
- Triton activations only feed forward
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance
- Bitsandbytes - ibrary is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions
- MInference Triton Kernels - FlashAttention
Kernel | Description | Resource |
---|---|---|
VectorAdd | A simple kernel that performs element-wise addition of two vectors. Useful for understanding the basics of GPU programming in Triton. | 1 2 |
Matmul | An optimized kernel for matrix multiplication, achieving high performance by leveraging memory hierarchy and parallelism. | 1 2 Grouped GEMM |
Softmax | A kernel for efficient computation of the softmax function, commonly used in machine learning models like transformers. | 1 2 3 |
Dropout | A kernel for implementing low-memory dropout, a regularization technique to prevent overfitting in neural networks. | 1 2 |
Layer Normalization | A kernel for layer normalization, which normalizes activations within a layer to improve training stability in deep learning models. | 1 2 3 |
Fused Attention | A kernel that efficiently implements attention mechanisms by combining multiple operations, key to transformers and similar architectures. | 1 2 |
Conv1d | A kernel for 1D convolution, often used in processing sequential data like time series or audio signals. | 1 |
Conv2d | A kernel for 2D convolution, a fundamental operation in computer vision tasks such as image classification or object detection. | 1 |
MultiheadAttention | A kernel for multi-head attention, a crucial component in transformer-based models for capturing complex relationships in data. | 1 |
Hardsigmoid | A kernel for the Hardsigmoid activation function, an efficient approximation of the sigmoid function used in certain neural network layers. | 1 |
GeLU | GeLU | 1 |
GeGLU | GeGLU | 1 |
RMSNorm | RMSNorm | 1 |
Feel free to contribute more resources or suggest updates by opening a pull request or issue in this repository.
This resource list is open-sourced under the MIT license.