CUDA Examples
A small and extensible benchmark suite to implement and test algorithms in CUDA. Currently, two and a half algorithms are available:
vector_sum
: it computes the sum of a numerical vector of sizeN
, as presented here and here.matrix_multiplication
: it computes the dense matrix product of twoN x N
matrices, as shown here.ppr
: Personalized PageRank, computed over an undirected graph for random personalization vertices. The GPU implementation is missing, it's your task to create it!
Building the code
make
: it will create abin
folder and ab
executable inside it.- You need
nvcc
andcuBLAS
to be available. - Make sure to update the
Makefile
to match your GPU architecture (open the file for more information) - This has been tested on Linux and on Windows 11 using WSL2
Run the code
bin/b -d -c -n 1000 -b vec -I 1 -i 30 -t 32 -B 64
Warning: If running on Kepler GPUs (E.g. Nvidia K80), change branch with git checkout kepler
to run code compatible with this GPU. The only difference is that the VectorSym example uses single-precision floating-point arithmetic, as atomicAdd
on double-precision was not available in this architecture. Be aware of numerical issues when using large inputs and single-precision arithmetic!
Options
-n
: size of the benchmark, e.g. number of elements in a vector or number of rows/colums in a square matrix. Unused for Personalized PageRank-i
: number of repetitions of the test (30 by default)-b
: name of the benchmark (vec
,mmul
,ppr
)-d
: if true, print debug results. Else, print a CSV-like format. If you change the code, make sure that nothing but the CSV is printed when-d
is not present.-c
: if true, validate the correctness of the output. It might take a while on large inputs!-t
: number of threads in a GPU block. If using 2D or 3D blocks, this is the size of the dimension (i.e. if-t
is 8 and you use a 2D block, the block has 64 threads). The number of threads in a block cannot exceed 1024.-B
: number of blocks in the grid, for implementations where this setting is available. If using 2D or 3D blocks, this is the size of the dimension.-I
: implementation to use for a given benchmark. For example,vec
has 3 different implementations (0, 1, 2).-s
: warm-up iterations to skip when computing the average execution time (3 by default).-v
: if usingnvprof
, using-v
ensures that the profiler captures only the GPU computation and not the benchmark initialization or CPU validation, making profiling simpler.
Additional options for Personalized PageRank
-g
: path to the MTX file of the input graph (California
by default)-a
: value of alpha (damping factor), from 0 to 1 (0.85 by default)-m
: maximum number of iterations done in PPR (30 by default)-e
: convergence threshold that must be met to stop the computation before the maximum number of iterations has been reached (1e-6
by default)
Contest: creating the Personalized PageRank benchmark
If you want to implement the Personalized PageRank benchmark, you need to implement the missing function in the PageRank class. The graph loading is already there for you, but you need to to the following:
alloc
: allocate GPU memory or additional CPU memory.init
: initialize the PageRank computation (e.g. starting values of the PPR vector).reset
: reset the result and transfer data to the GPU.execute
: if you want to have multiple implementations, follow the pattern used in the other benchmarks. Otherwise, here you need to have the GPU host code responsible of starting and synchronizing the computation, and transferring the output from the GPU to the CPU.cpu_validation
: it computes the PPR computation on the CPU. GPU results must be moved to thepr
array, so that they can be checked.print_result
: used for debugging, you can print the output in whatever format you prefer.clean
: deallocate GPU memory or additional CPU memory.
Things to keep in mind:
- Make sure that the code can still be compiled with
make
, and no extra steps. - Make sure that the code prints the standard CSV with the result, and nothing else, when not using the debug flag.
- Create a run.sh script to execute the benchmark, with whatever input parameter (grid size, implementation etc.) you prefer