Step-by-step optimization of matrix multiplication, implemented in CUDA. For an explanation of each kernel, see siboehm.com/CUDA-MMM.
Running the kernels on a NVIDIA A6000 (Ampere):
GFLOPs at matrix size 4096x4096:
Kernel | GFLOPs/s | Performance relative to cuBLAS |
---|---|---|
1: Naive | 309.0 |
1.3% |
2: GMEM Coalescing | 1986.5 |
8.5% |
3: SMEM Caching | 2980.3 |
12.8% |
4: 1D Blocktiling | 8474.7 |
36.5% |
5: 2D Blocktiling | 15971.7 |
68.7% |
7: Avoid Bank Conflicts (Linearize) | 16213.4 |
69.7% |
8: Avoid Bank Conflicts (Offset) | 16459.2 |
70.8% |
11: Double Buffering | 17278.3 |
74.3% |
6: Vectorized Mem Access | 18237.3 |
78.4% |
9: Autotuning | 19721.0 |
84.8% |
10: Warptiling | 21779.3 |
93.7% |
0: cuBLAS | 23249.6 |
100.0% |
- Install dependencies: CUDA toolkit 12, Python (+ Seaborn), CMake, Ninja. See environment.yml.
- Configure NVCC compilation parameters. Look up your GPUs compute
capability here. Then configure the
CMakeLists.txt
and change:set(CUDA_COMPUTE_CAPABILITY 80)
- Build:
mkdir build && cd build && cmake .. && cmake --build .
- Run one of the kernels:
DEVICE=<device_id> ./sgemm <kernel number>
- Profiling via NVIDIA Nsight Compute (ncu):
make profile KERNEL=<kernel number>
Credit goes to wangzyon/NVIDIA_SGEMM_PRACTICE for the benchmarking setup.