Generated by DALL·E 3 |
This is an extremely minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line file gemm.cuh which implements half-precision tensor core matrix multiplication, optimised for Turing (SM75) architecture.
The implementation builds on top of CuTe from CUTLASS, a low-level interface for tensor manipulation in CUDA. The code is well-commented and is meant to be easily readable (minimal CUDA/C++ background knowledge required) and hackable.
Benchmark against standard implementations (see main.cu and reference.cu):
$ ./main
Usage: ./main M N K iters
$ ./main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413
$ ./main 8192 8192 8192 100
Time elapse: 4819.51ms
TFLOPS: 22.8138
$ ./reference 4096 4096 4096 1000
Time elapse: 6040.42ms
TFLOPS: 22.7532
$ ./reference 8192 8192 8192 100
Time elapse: 4657.08ms
TFLOPS: 23.6095
The theoretical maximum for the hardware I used (RTX 2060) is 26 TFLOPS.
Requires CUDA installed. Check out https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ for instructions. If you don't have a compatible GPU, you can run this in Colab:
Compile the main.cu
file:
nvcc \
--include-path ./ \
--include-path cutlass/include \
--generate-code=arch=compute_75,code=[compute_75,sm_75] \
--expt-relaxed-constexpr \
-forward-unknown-to-host-compiler \
-std=c++17 \
-O3 \
-o build/main \
main.cu
And run!
$ ./build/main
Usage: ./main M N K iters
$ ./build/main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413
You can also build with CMake
(a better option for development):
$ mkdir build
$ cd build/
$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /workspaces/simpleGEMM/build
$ make main
Consolidate compiler generated dependencies of target main
[ 50%] Building CUDA object CMakeFiles/main.dir/main.cu.o
[100%] Linking CUDA executable main
[100%] Built target main
$ ./main
Usage: ./main M N K iters
The code trades off generality for simplicity:
- Only supports fp16 matmul out of the box. It should be quite easy to move to bf16, though.
- Optimised for SM75 w/ tensor cores. This is probably sub-optimal for SM80+ (e.g. A100), but probably not terrible either.
- Assumes (asserts) the inputs are divisible by the block size.
- Assumes the inputs are in row-major layout. (Though you probably only want to use a row-major layout anyway, as other combinations are 10-30% slower.)
- Doesn't do software pipelining. (interleaving global memory load for the next tile with computation.)
- Is only optimal for "normal" problem sizes. For more exotic problem sizes like small-M/N with large-K, specialised implementations like split-K kernel is likely to perform better.