Benchmarking code for "PopSparse: Accelerated block sparse matrix multiplication on IPU" paper.
We recommend starting with the PyTorch demo notebook.
To produce GPU timing numbers, make the following modifications to third-party benchmarks:
- Using CUDA 11.6.2 on a DGX A100.
- (BSR) Clone ceruleangu/Block-Sparse-Benchmark.
- Replace
num_r_block -> (num_r_block - 1)
,num_c_block -> (num_c_block - 1)
ingenerate_candidate_blocks()
.
- Replace
- (Dense) Clone hbrunie/cublas_benchmarks.
- (Optional) We recommend wrapping all CUDA calls in
CHECK_CUDA
macros to identify errors. - Add
cudaDeviceSynchronize()
at the beginning ofGpuTimer::Start()
. - Start
GpuTimer
after the first 5 runs. - Stop after 20 timed runs, recording the total time using
cudaEventElapsedTime()
.
Copyright (c) 2023 Graphcore Ltd. Licensed under the MIT License.
The included code is released under a MIT license (see LICENSE).
Our dependencies are:
Component | About | License |
---|---|---|
cxxopts | CLI option parsing library (https://github.com/jarro2783/cxxopts) | MIT |
matplotlib | Plotting library | BSD |
numpy | Scientific computing with Python | BSD 3-Clause |