Triton OpenAI

A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.

Official Documentation

Official Triton Documentation

My daily challange (Triton day by day)

This project is a step-by-step learning journey where we implement various types of Triton kernels—from the simplest examples to more advanced applications—while exploring GPU programming with Triton. The goal of this repository is to help you (and others) get comfortable with Triton by:

Starting simple: begin with basic kernels such as vector addition, and understand the building blocks of writing GPU code with Triton.
Incremental learning: each day introduces a new challenge, progressively covering more complex topics, techniques, and optimizations.
Hands-on experience: code, test, and benchmark your kernels against standard implementations (e.g., PyTorch) to see performance improvements and better understand GPU behavior.

Daily challenges: every day, a new challenge is posted in this repository. Each challenge focuses on a specific aspect of Triton, such as:

Basic operations (e.g., vector addition)
Memory management and optimizations
Advanced indexing and dynamic shapes
Multi-dimensional kernels
Reduction operations and more
Detailed explanations: each kernel comes with an in-depth explanation of the code, helping you understand the concepts behind the implementation.
Benchmarking and stress tests: learn how to measure performance by comparing custom Triton kernels with standard PyTorch implementations. Get hands-on experience with benchmarking on real-world GPU workloads.

Day	Kernel	Description
#1	Constant add	This challenge is the first puzzle in our Daily Triton Challenge series. The goal is to write a Triton kernel that adds a constant value to each element of a vector.
#2	Add two vectors	Simple example of how to add two vectors using a custom GPU kernel written in Triton and compares the result to a standard PyTorch implementation.
#3	Add two vectors with speed benchmarking	This is almost the same as #2 but we meaesure kernel execution speed and compare it to Pytorch implementation.

Articles

Gain deeper insights into Triton through these detailed articles:

Understanding the Triton Tutorials Part 1 and Part 2
Softmax in OpenAI Triton -> more detailed Fused Softmax Triton example explanation (step-by-step)
Accelerating AI with Triton: A Deep Dive into Writing High-Performance GPU Code
Accelerating Triton Dequantization Kernels for GPTQ
Triton Tutorial #2
Triton: OpenAI’s Innovative Programming Language for Custom Deep-Learning Primitives
Triton Kernel Compilation Stages
Deep Dive into Triton Internals Part 1, Part 2 and Part 3
Exploring Triton GPU programming for neural networks in Java
Using User-Defined Triton Kernels with torch.compile
Mamba: The Hard Way
FP8: Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton
FP8: Deep Dive on CUTLASS Ping-Pong GEMM Kernel
FP8: Deep Dive on the Hopper TMA Unit for FP8 GEMMs
Technical Review on PyTorch2.0 and Triton
Towards Agile Development of Efficient Deep Learning Operators
Developing Triton Kernels on AMD GPUs
CUDA-Free Inference for LLMs

Research Papers

Explore the academic foundation of Triton:

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

Videos

Learn by watching these informative videos:

Triton community meetup

Watch Triton community meetups to be up to date with Triton recent topics.

2024-11-09

Triton-Puzzles

Challenge yourself with these engaging puzzles:

Tools

Enhance your Triton development workflow with these tools:

Triton Deja-vu Framework to reduce autotune overhead of triton-lang to zero for well known deployments. This small framework is based on the Triton autotuner and contributes two features to the Triton community: 1. store and safely restore autotuner states using JSON files, 2. ConfigSpaces to explore a defined space exhaustively. Additionally, it allows to use heuristics in combination with the autotuner.
Triton Profiler and video explaining how to use it Dev Tools: Proton/Interpreter
Triton-Viz: A Visualization Toolkit for Programming with Triton
Make Triton easier - Triton-util provides simple higher-level abstractions for frequent but repetitive tasks. This allows you to write code that is closer to how you actually think.
TritonBench is a collection of PyTorch operators used to evaluation the performance of Triton, and its integration with PyTorch.

Conferences

Catch up on the latest advancements from Triton Conferences:

Sample Kernels

Explore practical implementations with these sample kernels:

Triton integrations

jax-triton

Triton backends

Intel® XPU Backend for Triton

Triton communities

CUDA-MODE

Triton Kernel Index

Kernel	Description	Resource
VectorAdd	A simple kernel that performs element-wise addition of two vectors. Useful for understanding the basics of GPU programming in Triton.	1 2
Matmul	An optimized kernel for matrix multiplication, achieving high performance by leveraging memory hierarchy and parallelism.	1 2 Grouped GEMM
Softmax	A kernel for efficient computation of the softmax function, commonly used in machine learning models like transformers.	1 2 3
Dropout	A kernel for implementing low-memory dropout, a regularization technique to prevent overfitting in neural networks.	1 2
Layer Normalization	A kernel for layer normalization, which normalizes activations within a layer to improve training stability in deep learning models.	1 2 3
Fused Attention	A kernel that efficiently implements attention mechanisms by combining multiple operations, key to transformers and similar architectures.	1 2
Conv1d	A kernel for 1D convolution, often used in processing sequential data like time series or audio signals.	1
Conv2d	A kernel for 2D convolution, a fundamental operation in computer vision tasks such as image classification or object detection.	1
MultiheadAttention	A kernel for multi-head attention, a crucial component in transformer-based models for capturing complex relationships in data.	1
Hardsigmoid	A kernel for the Hardsigmoid activation function, an efficient approximation of the sigmoid function used in certain neural network layers.	1
GeLU	GeLU	1
GeGLU	GeGLU	1
RMSNorm	RMSNorm	1

Triton updates, news, new features

Automatic Warp Specialization Optimization

Contribution

Feel free to contribute more resources or suggest updates by opening a pull request or issue in this repository.

License

This resource list is open-sourced under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
assets		assets
daily_challange		daily_challange
matmul		matmul
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton OpenAI

Official Documentation

My daily challange (Triton day by day)

Articles

Research Papers

Videos

Triton community meetup

Triton-Puzzles

Tools

Conferences

Sample Kernels

Triton integrations

Triton backends

Triton communities

Triton Kernel Index

Triton updates, news, new features

Contribution

License

About

Releases

Packages

Languages

rkinas/triton-resources

Folders and files

Latest commit

History

Repository files navigation

Triton OpenAI

Official Documentation

My daily challange (Triton day by day)

Articles

Research Papers

Videos

Triton community meetup

Triton-Puzzles

Tools

Conferences

Sample Kernels

Triton integrations

Triton backends

Triton communities

Triton Kernel Index

Triton updates, news, new features

Contribution

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages