CUDAmop (CUDA Matrix Operators and Profiling)

This repo provides implementations of common CUDA matrix opeartors and corresponding profiling-program suite, including:

Version	Operator	Profiling Program
Vector Addition
Utilize manually-manaed memory	`src/vector_addition.cu` `vectorAdd`	`profiling/vector_addition/basic.cu`
Utilize unified memory-based interfaces	`src/vector_addition.cu` `vectorAdd`	`profiling/vector_addition/unified.cu`
Utilize unified memory-based interfaces with prefetching and memory hint	`src/vector_addition.cu` `vectorAdd`	`profiling/vector_addition/unified_prefetch.cu`
cuBLAS	`cublasSaxpy_v2`	`profiling/cublas/vector_add.cu`
Squared Matrix Multiplication
Version	Operator	Profiling Program
Naive version of squared matrix multiplication	`src/matrix_mul.cu` `squareMatrixMul`	`profiling/matrix_multiplication/basic.cu`
Align memory access pattern through matrix transposing	`src/matrix_mul.cu` `alignedSquareMatrixMul`	`profiling/matrix_multiplication/aligned.cu`
Utilize scratchpad memory for tiled matrix multiplication	`src/matrix_mul.cu` `tiledSquareMatrixMul`	`profiling/matrix_multiplication/tiled.cu`
cuBLAS	`cublasSgemm_v2`	`profiling/cublas/matrix_multiplication.cu`
Sum Reduction
Version	Operator	Profiling Program
Naive implementation	`src/sum_reduction.cu` `sumReduction`	`profiling/sum_reduction/basic.cu`
Implementation without warp divergence	`src/sum_reduction.cu` `nonDivergenceSumReduction`	`profiling/sum_reduction/non_divergence.cu`

I also wrote corresponding blogs (in Chinese) for the underhood details behind these profiling test (available here), welcome to read and comments if you have any suggestion.

Build Project

Preparation

Host equipped with NVIDIA CUDA-capable GPU, see CUDA GPUs - NVIDIA Developer;
OS with NVIDIA Driver and CUDA Tookit installed, to check:

# check driver status
$ nvidia-smi
Wed Jul 27 13:54:53 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
# Your GPU Info ...

# check cuda status
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

OS with build essential tools installed

# Ubuntu
sudo apt-get install build-essential

# CentOS
sudo yum install \
        autoconf automake binutils \
        bison flex gcc gcc-c++ gettext \
        libtool make patch pkgconfig \
        redhat-rpm-config rpm-build rpm-sign \
        ctags elfutils indent patchutils

Build Project

use cmake to create Makefile for operators and profiling program:

# create subdirectory named "build"
mkdir build

# run cmake under [path to root]/build
cd build
cmake ..

directory named bin and lib would be automatically created under root path, then run Makefile to construct final executable.

# run make under [path to root]/build
make

then profiling executables can be obtained under [path to root]/bin, operator library can be obtained under [path to root]/lib

Profiling

Preparation

Both on the server and client side, with NVIDIA Nsight System installed, see NVIDIA Nsight Systems.

Usage

cd [path to root]
nsys profile --force-overwrite true -o result/[target name] [path to root]/bin/[target name]

Profiling files under result are tested using NVIDIA A4000

zobinHuang/CUDAmop

CUDAmop (CUDA Matrix Operators and Profiling)

Build Project

Preparation

Build Project

Profiling

Preparation

Usage