Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.
Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz + NVIDIA Titan X + Ubuntu 14.04 x86_64
##Imagenet Winners Benchmarking I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.
AlexNet (One Weird Trick paper) - Input 128x3x224x224
| Library | Class | Time (ms) | forward (ms) | backward (ms) |
|---|---|---|---|---|
| Nervana-fp16 | ConvLayer | 92 | 29 | 62 |
| CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 96 | 30 | 66 |
| CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 96 | 32 | 64 |
| Nervana-fp32 | ConvLayer | 101 | 32 | 69 |
| fbfft | fbnn.SpatialConvolution | 104 | 31 | 72 |
| cudaconvnet2* | ConvLayer | 177 | 42 | 135 |
| CuDNN[R2] * | cudnn.SpatialConvolution | 231 | 70 | 161 |
| Caffe (native) | ConvolutionLayer | 324 | 121 | 203 |
| Torch-7 (native) | SpatialConvolutionMM | 342 | 132 | 210 |
| CL-nn (Torch) | SpatialConvolutionMM | 963 | 388 | 574 |
| Caffe-CLGreenTea | ConvolutionLayer | 1442 | 210 | 1232 |
Overfeat [fast] - Input 128x3x231x231
| Library | Class | Time (ms) | forward (ms) | backward (ms) |
|---|---|---|---|---|
| CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 313 | 107 | 206 |
| CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 326 | 113 | 213 |
| fbfft | SpatialConvolutionCuFFT | 342 | 114 | 227 |
| Nervana-fp16 | ConvLayer | 355 | 112 | 242 |
| Nervana-fp32 | ConvLayer | 398 | 124 | 273 |
| cudaconvnet2* | ConvLayer | 723 | 176 | 547 |
| CuDNN[R2] * | cudnn.SpatialConvolution | 810 | 234 | 576 |
| Caffe | ConvolutionLayer | 823 | 355 | 468 |
| Torch-7 (native) | SpatialConvolutionMM | 878 | 379 | 499 |
| CL-nn (Torch) | SpatialConvolutionMM | 963 | 388 | 574 |
| Caffe-CLGreenTea | ConvolutionLayer | 2857 | 616 | 2240 |
OxfordNet [Model-A] - Input 64x3x224x224
| Library | Class | Time (ms) | forward (ms) | backward (ms) |
|---|---|---|---|---|
| Nervana-fp16 | ConvLayer | 529 | 167 | 362 |
| Nervana-fp32 | ConvLayer | 590 | 180 | 410 |
| CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 615 | 179 | 436 |
| CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 615 | 196 | 418 |
| fbfft | SpatialConvolutionCuFFT | 1092 | 355 | 737 |
| cudaconvnet2* | ConvLayer | 1229 | 408 | 821 |
| CuDNN[R2] * | cudnn.SpatialConvolution | 1099 | 342 | 757 |
| Caffe | ConvolutionLayer | 1068 | 323 | 745 |
| Torch-7 (native) | SpatialConvolutionMM | 1105 | 350 | 755 |
| CL-nn (Torch) | SpatialConvolutionMM | 3437 | 875 | 2562 |
| Caffe-CLGreenTea | ConvolutionLayer | 5620 | 988 | 4632 |
GoogleNet V1 - Input 128x3x224x224
| Library | Class | Time (ms) | forward (ms) | backward (ms) |
|---|---|---|---|---|
| Nervana-fp16 | ConvLayer | 283 | 85 | 197 |
| Nervana-fp32 | ConvLayer | 322 | 90 | 232 |
| CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 431 | 117 | 313 |
| CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 501 | 109 | 392 |
| Caffe | ConvolutionLayer | 1935 | 786 | 1148 |
| CL-nn (Torch) | SpatialConvolutionMM | 7016 | 3027 | 3988 |
| Caffe-CLGreenTea | ConvolutionLayer | 9462 | 746 | 8716 |
###Spatial Convolution layer (3D input 3D output, densely connected)
| Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
|---|---|---|---|---|
| fbfft | SpatialConvolutionCuFFT | 256 | 101 | 155 |
| cuda-convnet2 * | ConvLayer | 977 | 201 | 776 |
| cuda-convnet** | pylearn2.cuda_convnet | 1077 | 312 | 765 |
| CuDNN R2 * | cudnn.SpatialConvolution | 1019 | 269 | 750 |
| Theano | CorrMM | 1225 | 407 | 818 |
| Caffe | ConvolutionLayer | 1231 | 396 | 835 |
| Torch-7 | SpatialConvolutionMM | 1265 | 418 | 877 |
| DeepCL | ConvolutionLayer | 6280 | 2648 | 3632 |
| cherry-picking**** | best per layer | 235 | 79 | 155 |
This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.
| Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
|---|---|---|---|---|
| Theano (experimental)*** | conv2d_fft | 1178 | 304 | 874 |
| Torch-7 | nn.SpatialConvolutionBHWD | 1892 | 581 | 1311 |
| ccv | ccv_convnet_layer | 809+bw | 809 | |
| Theano (legacy) | conv2d | 70774 | 3833 | 66941 |
- * indicates that the library was tested with Torch bindings of the specific kernels.
- ** indicates that the library was tested with Pylearn2 bindings.
- *** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
- **** The last row shows results obtainable when choosing the best-performing library for each layer.
- L1 - Input:
128x128Batch-size128, Feature maps:3->96, Kernel Size:11x11, Stride:1x1 - L2 - Input:
64x64Batch-size128, Feature maps:64->128, Kernel Size:9x9, Stride:1x1 - L3 - Input:
32x32Batch-size128, Feature maps:128->128, Kernel Size:9x9, Stride:1x1 - L4 - Input:
16x16Batch-size128, Feature maps:128->128, Kernel Size:7x7, Stride:1x1 - L5 - Input:
13x13Batch-size128, Feature maps:384->384, Kernel Size:3x3, Stride:1x1 - The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)
#####Breakdown
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
| Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
|---|---|---|---|---|---|---|---|
| fbfft | SpatialConvolutionCuFFT | 57 | 27 | 6 | 2 | 9 | 101 |
| cuda-convnet2 * | ConvLayer | 36 | 113 | 40 | 4 | 8 | 201 |
| cuda-convnet** | pylearn2.cuda_convnet | 38 | 183 | 68 | 7 | 16 | 312 |
| CuDNN R2 | cudnn.SpatialConvolution | 56 | 143 | 53 | 6 | 11 | 269 |
| Theano | CorrMM | 91 | 143 | 121 | 24 | 28 | 407 |
| Caffe | ConvolutionLayer<Dtype> | 93 | 136 | 116 | 24 | 27 | 396 |
| Torch-7 | nn.SpatialConvolutionMM | 94 | 149 | 123 | 24 | 28 | 418 |
| DeepCL | ConvolutionLayer | 738 | 1241 | 518 | 47 | 104 | 2648 |
| cherry-picking**** | best per layer | 36 | 27 | 6 | 2 | 8 | 79 |
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
| Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
|---|---|---|---|---|---|---|---|
| fbfft | SpatialConvolutionCuFFT | 76 | 45 | 12 | 4 | 18 | 155 |
| cuda-convnet2 * | ConvLayer | 103 | 467 | 162 | 15 | 29 | 776 |
| cuda-convnet** | pylearn2.cuda_convnet | 136 | 433 | 147 | 15 | 34 | 765 |
| CuDNN R2 | cudnn.SpatialConvolution | 139 | 401 | 159 | 19 | 32 | 750 |
| Theano | CorrMM | 179 | 405 | 174 | 29 | 31 | 818 |
| Caffe | ConvolutionLayer<Dtype> | 200 | 405 | 172 | 28 | 30 | 835 |
| Torch-7 | nn.SpatialConvolutionMM | 206 | 432 | 178 | 29 | 32 | 877 |
| DeepCL | ConvolutionLayer | 484 | 2144 | 747 | 59 | 198 | 3632 |
| cherry-picking**** | best per layer | 76 | 45 | 12 | 4 | 18 | 155 |