This is a microbenchmark for timing Scatter/Gather kernels on CPUs and GPUs. View the source and read more abut Spatter in our recently submitted paper. Please submit an issue on Github if you run into any issues.
For some time now, memory has been the bottleneck in modern computers. As CPUs grow more memory hungry due to increased clock speeds, an increased number of cores, and larger vector units, memory bandwidth and latency continue to stagnate. While increasingly complex cache hierarchies have helped ease this problem, they are best suited for regular memory accesses with large amounts of locality. However, there are many programs which do not display regular memory patterns and do not reuse data much, and thus do not benefit from such hierarchies. Irregular programs, which include many sparse matrix and graph algorithms, drive us to search for new approaches to better utilize what little memory bandwidth is available.
With this benchmark, we aim to characterize the performance of memory systems in a novel way. We want to be able to make comparisons across architectures about how well data can be rearranged, and we want to be able to use benchmark results to predict the runtimes of sparse algorithms on these various architectures. We will use these results to predict the impact of new memory access primitives.
Spatter supports the following primitives:
Scatter:
A[j[:]] = B[:]
Gather:
A[:] = B[i[:]]
S+G:
A[j[:]] = B[i[:]]
This diagram depicts the full Scatter+Gather. Gather performs on the top half of this diagram and Scatter the second half.
CMake is required to build Spatter
To build with CMake from the main source directory:
./configure/configure_ocl
cd build_ocl
make
or use one of the other configure scripts to compile with different backends.
The only required argument to spatter is the amount of data to move. It will guess all other arguments such as kernel and device. However, this produces data for a single sparsity (default is 1) and doesn't do any tuning. To obtain more useful output, continue on to the next section.
./spatter -l 2048
You can quickly compare one of your platforms to some of the GPUs we have tested on. We will add much more flexibility to this in the future, but for now, we will assume you are using CUDA.
You must have R installed to generate the plot.
Steps:
-
You will need the bandwidth of your GPU. If you don't know it, you can go to
tests/run_babel_stream.sh
and run it. The results will be intests/BabelStream-3.3/babelstream_DEVICENAME_cuda.txt
. Note the max copy bandwidth. -
Go to your build folder (
build_cuda
) and runsparsity_test.sh
. This will take a while. (But it will be optimized soon!) -
Go to the
quickstart
directory (sibling of your build directory) and run./gather_comparison.sh ../build_cuda/sg_sparse_roofline_cuda_user_GATHER.ssv BANDWIDTH
, whereBANDWIDTH
is the bandwidth from step 1. -
This will produce
gather_comparison.eps
in thequickstart
directory. Your device will be called "USER", and will be colored orange.
Spatter has a large number of arguments. To start with, you should focus on -k (the kernel), -l (the length of the index arrays), -v (the work per thread) and -z (the CUDA/OpenCL block size).
./spatter <arguments>
-b, --backend=<backend>
Specify backend: OpenCL or OpenMP
-p, --cl-platform=<platform>
Specify platform if using OpenCL (case-insensitve, fuzzy matching)
-d, --cl-device=<device>
Specify device if using OpenCL (case-insensitve, fuzzy matching)
--interactive
Tell spatter you want to pick the platform and device interactively
-f, --kernel-file=<file>
Specify the location of a kernel file
-k, --kernel-name=<name>
Specify the name of the kernel (scatter, gather, or sg) you want to run
-v, --vector-len
Specifies the work per thread (poorly named, sorry)
-l, --generic-len
The number of elements to move. Automacially sets source-len, target-len, and index-len based on the kernel
-W, --workers
The number of OMP threads to use
-w, --wrap
More info coming soon
-s, --sparsity
Sparsity of soruce or target buffers
-z
GPU, OpenCL block size
-q
Supress warnings
-nph, --no-print-header
Don't print the header on the output
--validate
Check the output of the kernel against naive CPU output
--source-len=<blocks>
The number of blocks that can be moved (default block size is 1)
--target-len=<blocks>
The number of blocks that can be filled
--index-len=<blocks>
The number of blocks that will be moved
--seed=<seed>
Optional: Specify random seed
--runs=<count>
Specify how many times to run the benchmark (default 10)
--loops=<count>
Specify how many scatters/gathers will be performed by a single run of the benchmark
Not yet implemented