Matrix Multiplication using CUDA

Alexandru Barsan 2023

Machine Specifications

CPU: Intel i7-7700K
GPU: Nvidia Geforce GTX 1080 Ti
RAM: DDR4 16 GB Memory size 1200 MHz DRAM Freq

Found 1 CUDA devices
id 0 b'NVIDIA GeForce GTX 1080 Ti' [SUPPORTED]
Compute Capability: 6.1
PCI Device ID: 0
PCI Bus ID: 1
UUID: GPU-d9ff490e-0dae-b9ff-d914-d72433dae551
Watchdog: Enabled
Compute Mode: WDDM
FP32/FP64 Performance Ratio: 32
Summary:
1/1 devices are supported
Device Name: NVIDIA GeForce GTX 1080 Ti
Maximum number of multiprocessors: 28
Maximum number of threads per block: 1024
Maximum number of blocks per grid: 1024
Maximum number of threads per grid: 2147483647
Maximum shared memory per block: 49152 bytes
Warp Size: 32
L2 Cache Size: 2883584 bytes

Prerequisites:

A NVIDIA CUDA capable graphics card, reference can be found here: https://developer.nvidia.com/cuda-gpus

The NVIDIA CUDA toolkit, can be found here: https://developer.nvidia.com/cuda-toolkit

The oneMKL Python distribution, can be found here: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=online

The following python libraries:

Switched over to Anaconda since apparently numpy uses MKL natively from Conda but not from pip...

Can be found: https://www.anaconda.com/download

Manual library installation

Then, to install packages manually, from an Administrator Anaconda prompt:

conda create --name GemmCUDA

conda activate GemmCUDA

conda install -c numpy

conda install -c numba

conda install -c nvidia cuda-toolkit

conda install -c nvidia cuda-python

conda install -c openpyxl

conda install -c libffi Pandas doesn't load this dependency properly

conda install -c cupy

Anaconda environment import

Create and activate from the environment.yml (will take some time to install all dependencies)

Cupy required downgrading to Python 3.9.18, the switchover to Anaconda was also interesting to say the least. Please import the environment using Anaconda and profit from my headaches

conda env create -f environment.yml

conda activate GemmCUDA

Running with different matrix sizes:

From the Anaconda admin prompt, navigate to the repo in the correct environment, then run the following example command: python Gemm_CUDA.py 4096 4096 4096

Profiling (most useful if running just one kernel)

nvprof python Gemm_CUDA.py 4096 4096 4096

Trials

  • All trials for Matrix Size will use the naive non-JIT implementation of GEMM
  • All trials after 6 will not use the naive non-JIT implementation for the sake of time and sanity (4096 was taking more than 20 hours...)
  • All trials will compare against naive GEMM with numba JIT and loop re-ordered GEMM with numba JIT
  • Threads Per Block affects all 4 kernels, TILE_DIM affects just shared memory caching and vectorized kernels, Trials 8 to 23 will only compare the GPU kernels
  • Profiling done on some interesting trials/kernels for trials 24 - 28
  • MKL and CuBLAS implementations are trials 30 - 36 for the same matrix sizes as 1-7, no other changes needed since they are optimised kernels already
Trial # Matrix Size Threads Per Block TILE_DIM Comment
1 32 16 16 Done
2 64 16 16 Done
3 128 16 16 Done
4 256 16 16 Done
5 512 16 16 Done
6 1024 16 16 Done
7 4096 16 16 Done
8 4096 4 4 Done
9 4096 4 8 Done
10 4096 4 16 Done
11 4096 4 32 Done
12 4096 8 4 Done
13 4096 8 8 Done
14 4096 8 16 Done
15 4096 8 32 Done
x 4096 16 4 numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR for shared memory access kernel
x 4096 16 8 numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR for shared memory access kernel
16 4096 16 16 Done
17 4096 16 32 Done
x 4096 32 4 numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR for shared memory access kernel
x 4096 32 8 numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR for shared memory access kernel
x 4096 32 16 numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR for shared memory access kernel
18 4096 32 32 Done
19 4096 16 4 Works for vectorized kernel
20 4096 16 8 Works for vectorized kernel
21 4096 32 4 Works for vectorized kernel
22 4096 32 8 Works for vectorized kernel
23 4096 32 16 Works for vectorized kernel
24 4096 4 32 Same as Trial 11, SM kernel profiled
25 4096 8 4 Same as Trial 12, SM kernel profiled
26 4096 8 32 Same as Trial 15, SM kernel profiled
27 4096 8 4 Same as Trial 12, Vec kernel profiled
28 4096 32 32 Same as Trial 18, Vec kernel profiled
30 32 16 16 Re-run for MKL and cuBLAS implementations
31 64 16 16 Re-run for MKL and cuBLAS implementations
32 128 16 16 Re-run for MKL and cuBLAS implementations
33 256 16 16 Re-run for MKL and cuBLAS implementations
34 512 16 16 Re-run for MKL and cuBLAS implementations
35 1024 16 16 Re-run for MKL and cuBLAS implementations
36 4096 16 16 Re-run for MKL and cuBLAS implementations
37 8192 16 16 Re-run for MKL and cuBLAS implementations compare with CUDA kernels for curiosity's sake
38 16384 16 16 Re-run for MKL and cuBLAS implementations compare with CUDA kernels for curiosity's sake

Memory Access Violation

In-depth call stack for above errors below, using improper tile dimensions for the number of threads per block causes memory access violations due to probably misaligment. This could be due to misalignment of shared memory accesses that is more forgiving at smaller threads per block and tile dimensions. Vectorization transpose forces alignment of shared memory and then just iterates through the working set based on tile_dim, probably why we avoid the error.

File "C:\Users\barsana\source\repos\GemmCUDA\Gemm_CUDA.py", line 49, in cuda_gemm C = C_global.copy_to_host() ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\barsana\AppData\Roaming\Python\Python311\site-packages\numba\cuda\cudadrv\devices.py", line 232, in _require_cuda_context return fn(*args, **kws) ^^^^^^^^^^^^^^^^ File "C:\Users\barsana\AppData\Roaming\Python\Python311\site-packages\numba\cuda\cudadrv\devicearray.py", line 277, in copy_to_host _driver.device_to_host(hostary, self, self.alloc_size, File "C:\Users\barsana\AppData\Roaming\Python\Python311\site-packages\numba\cuda\cudadrv\driver.py", line 3145, in device_to_host fn(host_pointer(dst), device_pointer(src), size, *varargs) File "C:\Users\barsana\AppData\Roaming\Python\Python311\site-packages\numba\cuda\cudadrv\driver.py", line 327, in safe_cuda_api_call self._check_ctypes_error(fname, retcode) File "C:\Users\barsana\AppData\Roaming\Python\Python311\site-packages\numba\cuda\cudadrv\driver.py", line 395, in _check_ctypes_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR