Sparsity Benchmarks

Benchmarking code for "PopSparse: Accelerated block sparse matrix multiplication on IPU" paper.

We recommend starting with the PyTorch demo notebook.

To produce GPU timing numbers, make the following modifications to third-party benchmarks:

Using CUDA 11.6.2 on a DGX A100.
(BSR) Clone ceruleangu/Block-Sparse-Benchmark.
- Replace num_r_block -> (num_r_block - 1), num_c_block -> (num_c_block - 1) in generate_candidate_blocks().
(Dense) Clone hbrunie/cublas_benchmarks.
(Optional) We recommend wrapping all CUDA calls in CHECK_CUDA macros to identify errors.
Add cudaDeviceSynchronize() at the beginning of GpuTimer::Start().
Start GpuTimer after the first 5 runs.
Stop after 20 timed runs, recording the total time using cudaEventElapsedTime().

References & license

The included code is released under a MIT license (see LICENSE).

Our dependencies are:

Component	About	License
cxxopts	CLI option parsing library (https://github.com/jarro2783/cxxopts)	MIT
matplotlib	Plotting library	BSD
numpy	Scientific computing with Python	BSD 3-Clause