This repository releases code for our paper SBNet: Sparse Blocks Network for Fast Inference. Please refer to our blog post for more context. Note that benchmarking in the paper was performed with an older version of this repo using TensorFlow 1.2, cuDNN 6.1 and commit cf8ea06.
This repository contains
- a TensorFlow custom operations library that implements SBNet,
- a Python implementation of sparse ResNet blocks, and
- a benchmark for performance comparison with Submanifold Sparse Convolutional Networks.
Installation was tested under Ubuntu 14.04 and 16.04 with TensorFlow 1.8, CUDA 9.0 and cuDNN 7.1.
Code was tested on and compiled for NVIDIA CUDA 6.1, 6.0, 5.2 and 7.0 architectures (Titan XP, GTX 1080Ti, GTX 1080, P100, V100, TitanV, and most Maxwell cards).
To compile for an older architecture please modify the Makefile and add the corresponding line, such as -gencode arch=compute_50,code=sm_50
for older cards such as laptop Maxwell.
Please refer to CUDA Wikipedia page to lookup the architecture code for your graphics card.
To build a release version of the library, run
cd sbnet_tensorflow/sbnet_ops && make
To run tests:
cd sbnet_tensorflow/sbnet_ops && make test
The library will be built in sbnet_tensorflow/sbnet_ops/build/libsbnet.so and symlinked to sbnet_tensorflow/sbnet_ops/libsbnet.so. To import the library into your TensorFlow Python code use the following command:
sbnet_module = tf.load_op_library('path_to_library/libsbnet.so')
The following Tensorflow ops are implemented in the op library:
sbnet_module.reduce_mask
sbnet_module.sparse_gather
sbnet_module.sparse_scatter
reduce_mask
op converts a dense mask to a list of active block indices.
In the following snippet the mask is expected to be a tensor of dimensions [N,H,W,1]
:
indices = sbnet_module.reduce_mask(
mask, tf.constant([BCH, BCW], dtype=tf.int32),
bsize=[BSZH, BSZW],
boffset=[BOFFSH, BOFFSW],
bstride=[BSTRH, BSTRW],
tol=0.5, # pooling threshold to consider a block as active
avgpool=True) # max pooling by default
[BCH, BCW] are block counts in height and width dimensions.
[BSZH, BSZW], [BOFFSH, BOFSFW] and [BSTRH, BSTRW] are block sizes, offsets and strides in H and W dimensions.
reduce_mask
performs a combined max pooling (or average pooling) operation localized to each block followed by generating
a list of triples of indices [(ni, hi, wi)]
for blocks where either max or average pooling value exceeds specified tolerance tol
.
In numpy terms each block is defined as a slice from the input mask of dimensions [N,H,W,1]
, with following dimensions:
[ni, BOFFSH+BSTRH*hi : BOFFSH+BSTRH*hi+BSZH, BOFFSW+BSTRW*wi : BOFFSW+BSTRW*wi+BSZW, :]
.
The resulting list of indices can then be passed to two other operations: sbnet_module.sparse_scatter
and sbnet_module.sparse_gather
.
The following snippets illustrate the use of these operations:
blockStack = sbnet_module.sparse_gather(
x,
indices.bin_counts,
indices.active_block_indices,
bsize=[BSZH, BSZW], # block size
boffset=[BOFFSH, BOFFSW], # block offset
bstride=[BSTRH, BSTRW], # block stride
transpose=do_transpose)
This operation will use the indices generated by reduce_mask and slice out tensors of channel depth C out of input tensor x
of dimensions [N,H,W,C]
as illustrated in the following pseudo-code snippet:
for (ni, hi, wi) in indices.active_block_indices:
channel_slice = x[ni, BOFFSH+BSTRH*hi : BOFFSH+BSTRH*hi+BSZH, BOFFSW+BSTRW*wi : BOFFSW+BSTRW*wi+BSZW, :]
blockStack[ni, :, :, :] = channel_slice
If do_transpose
is true, a fused transpose operation will also be performed and the resulting tensor will have dimensions [nBlocks, C, BSZH, BSZW]
.
Any out-of-range values will be padded with zeroes.
The inverse operation is sbnet_module.sparse_scatter
. The following snippet illustrates it's use:
y = sbnet_module.sparse_scatter(
blockStack,
indices.bin_counts,
indices.active_block_indices,
x, # base tensor to copy to output and overwrite on top of
bsize=[BSZH, BSZW],
boffset=[BOFFSH, BOFFSW],
bstride=[BSTRH, BSTRW],
add=do_add,
atomic=False, # use atomic or regular adds
transpose=do_transpose)
Note that due to a limitation of TensorFlow API an intermediate tensor cannot be modified in place unless it's specified to be a tf.Variable.
This necessitates creating an intermediate tensor inside the op and performing a copy which has negative implications for performance.
So we created a second version of the op sbnet_module.sparse_scatter_var
that expects x to be a tf.Variable
and modifies it in place.
Using sparse_scatter_var
is strongly recommended for maximum performance.
The effect of this operation is opposite to sparse_gather
- the input blocks will be written on top of base tensor x, or added to it's contents if do_add
is True.
The following pseudo-code snippet illustrates the semantics of sparse_scatter
:
for (ni, hi, wi) in indices.active_block_indices:
if do_add:
x[ni, BOFFSH+BSTRH*hi : BOFFSH+BSTRH*hi+BSZH, BOFFSW+BSTRW*wi : BOFFSW+BSTRW*wi+BSZW, :]\
+= blockStack[ni, :, :, :]
else:
x[ni, BOFFSH+BSTRH*hi : BOFFSH+BSTRH*hi+BSZH, BOFFSW+BSTRW*wi : BOFFSW+BSTRW*wi+BSZW, :]\
= blockStack[ni, :, :, :]
So the blocks are 'put back in place', however the sizes and strides can be different from those passed to sparse_gather. This enables implementation of sparse ResNet blocks where output resolution is reduced
after a 'VALID' convolution. Similar to sparse_gather
, if do_transpose
is true, a fused transpose operation will also be performed by sparse_scatter, permuting the input [N,C,H,W]
dimensions to [N,H,W,C]
in the output.
Typically the block size for a 'VALID' convolution is reduced by 2 in each spatial dimension for each 3x3 convolution, thus creating non-overlapping outputs.
Note that even though currently we support atomic adds in scatter with add=True, the gradient is not implemented at this time if overlapping scatters are used the forward pass.
Benchmarks for SBNet are located in sbnet_tensorflow/benchmarks/ subdirectory.
To run benchmarks execute:
cd sbnet_tensorflow/benchmarks && ./run_all_behchmarks.bash
Note that we average over a number of runs and test many permutations of parameters so this may take about 20 minutes (on a Titan XP) and will produce a number of .csv files in your /home/user/ directory. We benchmark individual sparse convolutions and entire sparse ResNet blocks on a synthetic mask with variable sparsity.
To run unit tests execute:
cd sbnet_tensorflow/sbnet_ops && make tests
For comparison we implemented benchmarking code for Submanifold Sparse Convolutional Networks. Running this benchmark requires Submanifold Sparse Convolutions python package to be installed:
git clone https://github.com/facebookresearch/SparseConvNet.git
Follow the setup instructions in SparseConvNet repo.
Code integration with Submanifold Sparse Convolutions was tested with git sha 609224df3c0e42b8a1dd4073aaa56fab805096c6. To reset the repo to this sha use the following sequence of commands:
cd SparseConvNet
git checkout 609224df3c0e42b8a1dd4073aaa56fab805096c6
The benchmark code is located in sbnet_tensorflow/benchmark_submanifold directory.
Current code is not tuned for performance with non-square block sizes and has specialized implementations for a specific list of block sizes. This includes square blocks of sizes 1 to 34 and a few others. To achieve maximum performance for these sizes you would need to add your custom template instantiations by modifying SIZE_TEMPLATES macro in sparse_gather.cu
.
For now, we do not accept pull request to this repo, as we are currently setting up automated CI. If you would like to contribute to this repository, feel free create a GitHub issue.
If you use our code, please consider cite the following: M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun. SBNet: Sparse Blocks Network for Fast Inference. CoRR, abs/1801.02108, 2018.
@article{ren18sbnet,
author = {Mengye Ren and
Andrei Pokrovsky and
Bin Yang and
Raquel Urtasun},
title = {SBNet: Sparse Blocks Network for Fast Inference},
journal = {CoRR},
volume = {abs/1801.02108},
year = {2018},
}