Learn-CUDA: A Cuda repository from andylolu2

Traps

THE COMPILER

There are about 1000000 (undocumented) flags that you have to pass to nvcc to make things work properly. If something doesn't compile / function properly, check the compiler flags first.

Flags that I missed:

-I / --include: To add a directory to find header files. E.g. -I=src/cutlass/include,src/cutlass/tools/util/include.
-l / --library: To include cuda libraries like CuBLAS / CUTLASS. E.g. -l=cublas,cublasLt.
-gencode: Specify which device you are compiling for. Needed to get things like Tensor Core GEMM working. E.g. -gencode=arch=compute_75,code=sm_75.

Hardward support

RTX 2060 only supports Tensor Core MatMuls with FP16 inputs and FP16/32 compute dtype. Using FP32 for both inputs and compute dtype will fail.

CuDNN find plan

cudnnFindPlan doesn't always get the correct plan, quite annoying.

Discoveries

TFLOPS

The TFLOPS estimation on canonical sources (e.g. https://www.techpowerup.com/gpu-specs) is not an accurate estimation of the theoretical max FLOPs. It probably takes into account for all types of operations (e.g. FP32,FP16,INT8,...). We can actually get a much higher TFLOPS if we are just using the Tensor Cores.

For example, the RTX 2060 MAX-Q is rated to have a max TFLOPS of 9.101 (FP16). However,

$ bin/cutlassExample 8192 1000 tmp.txt
N = 8192, TIMES = 1000, FILENAME = tmp.txt
N = 8192, 0.0366 ops/ms, 40.1975 TFLOPS