simpleGEMM


Generated by DALL·E 3

This is an extremely minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line file gemm.cuh which implements half-precision tensor core matrix multiplication, optimised for Turing (SM75) architecture.

The implementation builds on top of CuTe from CUTLASS, a low-level interface for tensor manipulation in CUDA. The code is well-commented and is meant to be easily readable (minimal CUDA/C++ background knowledge required) and hackable.

Benchmark against standard implementations (see main.cu and reference.cu):

$ ./main
Usage: ./main M N K iters

$ ./main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

$ ./main 8192 8192 8192 100
Time elapse: 4819.51ms
TFLOPS: 22.8138

$ ./reference 4096 4096 4096 1000
Time elapse: 6040.42ms
TFLOPS: 22.7532

$ ./reference 8192 8192 8192 100
Time elapse: 4657.08ms
TFLOPS: 23.6095

The theoretical maximum for the hardware I used (RTX 2060) is 26 TFLOPS.

Quick start

Requires CUDA installed. Check out https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ for instructions. If you don't have a compatible GPU, you can run this in Colab:

Compile the main.cu file:

nvcc \
    --include-path ./ \
    --include-path cutlass/include \
    --generate-code=arch=compute_75,code=[compute_75,sm_75] \
    --expt-relaxed-constexpr \
    -forward-unknown-to-host-compiler \
    -std=c++17 \
    -O3 \
    -o build/main \
    main.cu

And run!

$ ./build/main
Usage: ./main M N K iters

$ ./build/main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

You can also build with CMake (a better option for development):

$ mkdir build
$ cd build/
$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /workspaces/simpleGEMM/build
$ make main 
Consolidate compiler generated dependencies of target main
[ 50%] Building CUDA object CMakeFiles/main.dir/main.cu.o
[100%] Linking CUDA executable main
[100%] Built target main
$ ./main
Usage: ./main M N K iters

What's missing

The code trades off generality for simplicity:

Only supports fp16 matmul out of the box. It should be quite easy to move to bf16, though.
Optimised for SM75 w/ tensor cores. This is probably sub-optimal for SM80+ (e.g. A100), but probably not terrible either.
Assumes (asserts) the inputs are divisible by the block size.
Assumes the inputs are in row-major layout. (Though you probably only want to use a row-major layout anyway, as other combinations are 10-30% slower.)
Doesn't do software pipelining. (interleaving global memory load for the next tile with computation.)
Is only optimal for "normal" problem sizes. For more exotic problem sizes like small-M/N with large-K, specialised implementations like split-K kernel is likely to perform better.

qianxinchun/simpleGEMM

simpleGEMM

Quick start

What's missing