Here you find a collection of CUDA related material (books, papers, blog-post, youtube videos, tweets, implementations etc.). We also collect information to higher level tools for performance optimization and kernel development like Triton and torch.compile()
... whatever makes the GPUs go brrrr.
You know a great resource we should add? Please see How to contribute.
- Lecture 1: How to profile CUDA in PyTorch, recording: Video, code: lecture1, speaker: Mark Saroufim
- Lecture 2: Recap of Chapters 1-3 from the PMPP book, recording: Video code: lecture2, speaker: Andreas Koepf
- Lecture 3: Getting Started With CUDA for Python Programmers, recording Video, code: lecture3, speaker: Jeremy Howard
- Lecture 4: Intro to Compute and Memory Architecture, recording: Video, notebook: lecture4, speaker: Thomas Viehmann
- Lecture 5: Going Further with CUDA for Python Programmers, recording: Video, notebook: lecture5, speaker: Jeremy Howard
- Lecture 6: Optimizing Optimizers, recording: Video
- Lecture 7: Advanced Quantization, recording: Video, speaker: Charles Hernandez
- Lecture 8: CUDA Performance Checklist, recording: Video, code: lecture8, speaker: Mark Saroufim
- Lecture 9: Reductions Video, slides code: lecture9, speaker: Mark Saroufim
- Lecture 10: Build a Prod Ready CUDA Library Video, speaker: Oscar Amoros Huguet, slides
- An Easy Introduction to CUDA C and C++
- An Even Easier Introduction to CUDA
- CUDA Toolkit Documentation
- Basic terminology: Thread block, Warp, Streaming Multiprocessor: Wiki: Thread Block, A tour of CUDA
- GPU Performance Background User's Guide
- OLCF NVIDIA CUDA Training Series, talk recordings can be found under the presentation footer for each lecture; exercises
- GTC 2022 - CUDA: New Features and Beyond - Stephen Jones
- Intro video: Writing Code That Runs FAST on a GPU
- A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Anatomy of high-performance matrix multiplication
- Programming Massively Parallel Processors: A Hands-on Approach
- Cuda by Example: An Introduction to General-Purpose Gpu Programming; code
- The CUDA Handbook
- The Book of Shaders guide through the abstract and complex universe of Fragment Shader (not cuda but GPU related)
- Art of HPC 4 books on HPC more generally, does not specifically cover GPUs but lessons broadly apply
- HetSys: Programming Heterogeneous Computing Systems with GPUs and other Accelerators
- Heterogeneous Parallel Programming Class (YouTube playlist) Prof. Wen-mei Hwu, University of Illinois
- Official YouTube channel for "Programming Massively Parallel Processors: A Hands-on Approach", course playlist: Applied Parallel Programming
- Programming Parallel Computers; covers both CUDA and CPU-parallelism. Use Open Course Version and you can even submit your own solutions to the exercises for testing and benchmarking.
- x: @tri_dao, gh: tridao
- Dao-AILab/flash-attention, paper
- state-spaces/mamba, paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces, minimal impl: mamba-minimal
- x: @Tim_Dettmers, gh: TimDettmers
- TimDettmers/bitsandbytes, docs: docs
- QLoRA: Efficient Finetuning of Quantized LLMs
- x: @srush_nlp, gh: srush
- Sasha Rush's GPU Puzzles, dshah3's CUDA C++ version & walkthrough video
- Mamba: The Hard Way, code: srush/annotated-mamba
- Accelerating Generative AI with PyTorch: Segment Anything, Fast
- Accelerating Generative AI with PyTorch II: GPT, Fast
- Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning
- Performance Debugging of Production PyTorch Models at Meta
- TorchDynamo Deep Dive
- PyTorch Compiler Troubleshooting
- PyTorch internals
- Pytorch 2 internals
- Understanding GPU memory: 1: Visualizing All Allocations over Time, 2: Finding and Removing Reference Cycles
- Debugging memory using snapshots: Debugging PyTorch memory use with snapshots
- CUDA caching allocaator: https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html
- Trace Analyzer: PyTorch Trace Analysis for the Masses
- Triton compiler tutorials
- CUDA C++ Programming Guide
- PyTorch: Custom C++ and CUDA Extensions, Code: pytorch/extension-cpp
- PyTorch C++ API
- pybind11 documentation
- NVIDIA Tensor Core Programming
- GPU Programming: When, Why and How?
- How GPU Computing Works | GTC 2021 (more basic than the 2022 version)
- How CUDA Programming Works | GTC 2022
- CUDA Kernel optimization Part 1 Part 2
- PTX and ISA Programming Guide (V8.3)
- Compiler Explorer: Inspect PTX: div 256 -> shr 8 example
- Nsight Compute Profiling Guide
- mcarilli/nsight.sh - Favorite nsight systems profiling commands for PyTorch scripts
- Profiling GPU Applications with Nsight Systems
- PyTorch
- Trtion, github: openai/triton
- numba @cuda.jit
- Apache TVM
- JAX Pallas
- CuPy NumPy compatible GPU Computing
- NVidia Fuser
- Codon @gpu.kernel, github: exaloop/codon
- Mojo (part of commercial MAX Plattform by Modular)
- NVIDIA Python Bindings: CUDA Python (calling NVRTC to compile kernels, malloc, copy, launching kernels, ..), cuDNN FrontEnd(FE) API, CUTLASS Python Interface
- TACO: The Tensor Algebra Compiler, gh: tensor-compiler/taco
- Mosaic compiler C++ DSL for sparse and dense tensors algebra (built on top of TACO), paper, presentation
- Cooperative Groups: Flexible CUDA Thread Programming (Oct 04, 2017)
- A friendly introduction to machine learning compilers and optimizers (Sep 7, 2021)
- NVIDIA H100 Whitepaper
- NVIDIA GH200 Whitepaper
- AMD CDNA 3 Whitepaper
- AMD MI300X Data Sheet
- Video: Can SRAM Keep Shrinking? (by Asianometry)
- see our ring-attention repo
- GPU Gems: Parallel Prefix Sum (Scan) with CUDA, PDF version (2007), impl: stack overflow, nicer impl: mattdean1/cuda
- Accelerating Reduction and Scan Using Tensor Core Units
- Thrust: Prefix Sums, Reference: scan variants
- CUB, part of cccl: NVIDIA/cccl/tree/main/cub
- SAM Algorithm: Higher-Order and Tuple-Based Massively-Parallel Prefix Sums (licensed for non commercial use only)
- CUB Algorithm: Single-pass Parallel Prefix Scan with Decoupled Look-back
- Group Experiments: johnryan465/pscan, andreaskoepf/pscan_kernel
unsloth
that implements custom kernels in Triton for faster QLoRA training- Custom implementation of relative position attention (link)
- Tri Dao's Triton implementation of Flash Attention: flash_attn_triton.py
- YouTube playlist: Triton Conference 2023
To share interesting CUDA related links please create a pull request for this file. See editing files in the github documentation.
Or contact us on the CUDA MODE discord server: https://discord.gg/cudamode