This repo provides implementations of common CUDA matrix opeartors and corresponding profiling-program suite, including:
Vector Addition | |||
Version | Operator | Profiling Program | |
---|---|---|---|
Utilize manually-manaed memory |
src/vector_addition.cu
vectorAdd
|
profiling/vector_addition/basic.cu |
|
Utilize unified memory-based interfaces |
src/vector_addition.cu
vectorAdd
|
profiling/vector_addition/unified.cu |
|
Utilize unified memory-based interfaces with prefetching and memory hint |
src/vector_addition.cu
vectorAdd
|
profiling/vector_addition/unified_prefetch.cu |
|
cuBLAS |
cublasSaxpy_v2
|
profiling/cublas/vector_add.cu |
|
Squared Matrix Multiplication | |||
Version | Operator | Profiling Program | |
Naive version of squared matrix multiplication |
src/matrix_mul.cu
squareMatrixMul
|
profiling/matrix_multiplication/basic.cu |
|
Align memory access pattern through matrix transposing |
src/matrix_mul.cu
alignedSquareMatrixMul
|
profiling/matrix_multiplication/aligned.cu |
|
Utilize scratchpad memory for tiled matrix multiplication |
src/matrix_mul.cu
tiledSquareMatrixMul
|
profiling/matrix_multiplication/tiled.cu |
|
cuBLAS |
cublasSgemm_v2
|
profiling/cublas/matrix_multiplication.cu |
|
Sum Reduction | |||
Version | Operator | Profiling Program | |
Naive implementation |
src/sum_reduction.cu
sumReduction
|
profiling/sum_reduction/basic.cu |
|
Implementation without warp divergence |
src/sum_reduction.cu
nonDivergenceSumReduction
|
profiling/sum_reduction/non_divergence.cu |
I also wrote corresponding blogs (in Chinese) for the underhood details behind these profiling test (available here), welcome to read and comments if you have any suggestion.
- Host equipped with NVIDIA CUDA-capable GPU, see CUDA GPUs - NVIDIA Developer;
- OS with NVIDIA Driver and CUDA Tookit installed, to check:
# check driver status
$ nvidia-smi
Wed Jul 27 13:54:53 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
# Your GPU Info ...
# check cuda status
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
- OS with build essential tools installed
# Ubuntu
sudo apt-get install build-essential
# CentOS
sudo yum install \
autoconf automake binutils \
bison flex gcc gcc-c++ gettext \
libtool make patch pkgconfig \
redhat-rpm-config rpm-build rpm-sign \
ctags elfutils indent patchutils
use cmake
to create Makefile for operators and profiling program:
# create subdirectory named "build"
mkdir build
# run cmake under [path to root]/build
cd build
cmake ..
directory named bin
and lib
would be automatically created under root path, then run Makefile to construct final executable.
# run make under [path to root]/build
make
then profiling executables can be obtained under [path to root]/bin
, operator library can be obtained under [path to root]/lib
Both on the server and client side, with NVIDIA Nsight System installed, see NVIDIA Nsight Systems.
cd [path to root]
nsys profile --force-overwrite true -o result/[target name] [path to root]/bin/[target name]
Profiling files under result
are tested using NVIDIA A4000