Here you find a collection of CUDA related material (books, papers, blog-post, youtube videos, tweets, implementations etc.). We also collect information to higher level tools for performance optimization and kernel development like Triton and torch.compile()
... whatever makes the GPUs go brrrr.
You know a great resource we should add? Please see How to contribute.
- An Easy Introduction to CUDA C and C++
- CUDA Toolkit Documentation
- Basic terminology: Thread block, Warp, Streaming Multiprocessor: Wiki: Thread Block, A tour of CUDA
- GPU Performance Background User's Guide
- OLCF NVIDIA CUDA Training Series, talk recordings can be found under the presentation footer for each lecture; exercises
- GTC 2022 - CUDA: New Features and Beyond - Stephen Jones
- A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Anatomy of high-performance matrix multiplication
- Programming Massively Parallel Processors: A Hands-on Approach
- Cuda by Example: An Introduction to General-Purpose Gpu Programming; code
- The CUDA Handbook
- Accelerating Generative AI with PyTorch: Segment Anything, Fast
- Accelerating Generative AI with PyTorch II: GPT, Fast
- PyTorch Compiler Troubleshooting
- PyTorch internals
- Pytorch 2 internals
- Triton compiler tutorials
- CUDA C++ Programming Guide
- pybind11 documentation
- NVIDIA Tensor Core Programming
- GPU Programming: When, Why and How?
- How CUDA Programming Works | GTC 2022
- Nsight Compute Profiling Guide
- mcarilli/nsight.sh - Favorite nsight systems profiling commands for PyTorch scripts
- Profiling GPU Applications with Nsight Systems
To share interesting CUDA related links please create a pull request for this file. See editing files in the github documentation.
Or contact us on the CUDA MODE discord server: https://discord.gg/jqYdBWreqb