Triton is a language for writing GPU kernels. It's easier to use than CUDA, and interoperates well with PyTorch.
If you want to speed up PyTorch training or inference speed, you can try writing kernels for the heavier operations using Triton. (flash attention is a good example of a custom GPU kernel that speeds up training)
This repo has my notes as I learn to use Triton. They include a lot of code, and some discussion of the key concepts. They're geared towards people new to GPU programming and Triton.
Hopefully you will find them useful.
- GPU Basics
- Vector Addition
- Matrix Multiplication
- Softmax forward and backward
- Block matmul
- Matmul forward and backward
To install Triton, just do pip install triton
. You need a CUDA-compatible GPU with CUDA installed to use it.
Material in these notebooks came from the following sources (and they're generally good documentation):