convnet-benchmarks

Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.

Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz + NVIDIA Titan X + Ubuntu 14.04 x86_64

##Imagenet Winners Benchmarking I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.

Notation

Input is described as {batch_size}x{num_filters}x{filter_width}x{filter_height}. Where batch_size is the number of images used in a minibatch, num_filters is the number of channels in an image, filter_width is the width of the image, and filter_height is the height of the image.

######One small note: The CuDNN benchmarks are done using Torch bindings. One can also do the same via Caffe bindings or bindings of any other library. This note is here to clarify that Caffe (native) and Torch (native) are the convolution kernels which are present as a default fallback. Some of the frameworks like TensorFlow and Chainer are benchmarked with CuDNN, but it is not explicitly mentioned, and hence one might think that these frameworks as a whole are faster, than for example Caffe, which might not be the case.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
CuDNN[R4]-fp16 (Torch)	cudnn.SpatialConvolution	71	25	46
Nervana-neon-fp16	ConvLayer	78	25	52
CuDNN[R4]-fp32 (Torch)	cudnn.SpatialConvolution	81	27	53
TensorFlow	conv2d	81	26	55
Nervana-neon-fp32	ConvLayer	87	28	58
fbfft (Torch)	fbnn.SpatialConvolution	104	31	72
Chainer	Convolution2D	177	40	136
cudaconvnet2*	ConvLayer	177	42	135
CuDNN[R2] *	cudnn.SpatialConvolution	231	70	161
Caffe (native)	ConvolutionLayer	324	121	203
Torch-7 (native)	SpatialConvolutionMM	342	132	210
CL-nn (Torch)	SpatialConvolutionMM	963	388	574
Caffe-CLGreenTea	ConvolutionLayer	1442	210	1232

Overfeat [fast] - Input 128x3x231x231

Library	Class	Time (ms)	forward (ms)	backward (ms)
Nervana-neon-fp16	ConvLayer	176	58	118
Nervana-neon-fp32	ConvLayer	211	69	141
CuDNN[R4]-fp16 (Torch)	cudnn.SpatialConvolution	242	86	156
CuDNN[R4]-fp32 (Torch)	cudnn.SpatialConvolution	268	94	174
TensorFlow	conv2d	279	90	189
fbfft (Torch)	SpatialConvolutionCuFFT	342	114	227
Chainer	Convolution2D	620	135	484
cudaconvnet2*	ConvLayer	723	176	547
CuDNN[R2] *	cudnn.SpatialConvolution	810	234	576
Caffe	ConvolutionLayer	823	355	468
Torch-7 (native)	SpatialConvolutionMM	878	379	499
CL-nn (Torch)	SpatialConvolutionMM	963	388	574
Caffe-CLGreenTea	ConvolutionLayer	2857	616	2240

OxfordNet [Model-A] - Input 64x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
Nervana-neon-fp16	ConvLayer	254	82	171
Nervana-neon-fp32	ConvLayer	320	103	217
CuDNN[R4]-fp16 (Torch)	cudnn.SpatialConvolution	471	140	331
CuDNN[R4]-fp32 (Torch)	cudnn.SpatialConvolution	529	162	366
TensorFlow	conv2d	540	158	382
Chainer	Convolution2D	885	251	632
fbfft (Torch)	SpatialConvolutionCuFFT	1092	355	737
cudaconvnet2*	ConvLayer	1229	408	821
CuDNN[R2] *	cudnn.SpatialConvolution	1099	342	757
Caffe	ConvolutionLayer	1068	323	745
Torch-7 (native)	SpatialConvolutionMM	1105	350	755
CL-nn (Torch)	SpatialConvolutionMM	3437	875	2562
Caffe-CLGreenTea	ConvolutionLayer	5620	988	4632

GoogleNet V1 - Input 128x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
Nervana-neon-fp16	ConvLayer	230	72	157
Nervana-neon-fp32	ConvLayer	270	84	186
TensorFlow	conv2d	445	135	310
CuDNN[R4]-fp16 (Torch)	cudnn.SpatialConvolution	462	112	349
CuDNN[R4]-fp32 (Torch)	cudnn.SpatialConvolution	470	130	340
Chainer	Convolution2D	687	189	497
Caffe	ConvolutionLayer	1935	786	1148
CL-nn (Torch)	SpatialConvolutionMM	7016	3027	3988
Caffe-CLGreenTea	ConvolutionLayer	9462	746	8716

Layer-wise Benchmarking (Last Updated April 2015)

###Spatial Convolution layer (3D input 3D output, densely connected)

forward + backprop (wrt input and weights)

Original Library	Class/Function Benchmarked	Time (ms)	forward (ms)	backward (ms)
fbfft	SpatialConvolutionCuFFT	256	101	155
cuda-convnet2 *	ConvLayer	977	201	776
cuda-convnet**	pylearn2.cuda_convnet	1077	312	765
CuDNN R2 *	cudnn.SpatialConvolution	1019	269	750
Theano	CorrMM	1225	407	818
Caffe	ConvolutionLayer	1231	396	835
Torch-7	SpatialConvolutionMM	1265	418	877
DeepCL	ConvolutionLayer	6280	2648	3632
cherry-picking****	best per layer	235	79	155

This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.

Original Library	Class/Function Benchmarked	Time (ms)	forward (ms)	backward (ms)
Theano (experimental)***	conv2d_fft	1178	304	874
Torch-7	nn.SpatialConvolutionBHWD	1892	581	1311
ccv	ccv_convnet_layer	809+bw	809
Theano (legacy)	conv2d	70774	3833	66941

* indicates that the library was tested with Torch bindings of the specific kernels.
** indicates that the library was tested with Pylearn2 bindings.
*** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
**** The last row shows results obtainable when choosing the best-performing library for each layer.
L1 - Input: 128x128 Batch-size 128, Feature maps: 3->96, Kernel Size: 11x11, Stride: 1x1
L2 - Input: 64x64 Batch-size 128, Feature maps: 64->128, Kernel Size: 9x9, Stride: 1x1
L3 - Input: 32x32 Batch-size 128, Feature maps: 128->128, Kernel Size: 9x9, Stride: 1x1
L4 - Input: 16x16 Batch-size 128, Feature maps: 128->128, Kernel Size: 7x7, Stride: 1x1
L5 - Input: 13x13 Batch-size 128, Feature maps: 384->384, Kernel Size: 3x3, Stride: 1x1
The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)

#####Breakdown

forward

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
fbfft	SpatialConvolutionCuFFT	57	27	6	2	9	101
cuda-convnet2 *	ConvLayer	36	113	40	4	8	201
cuda-convnet**	pylearn2.cuda_convnet	38	183	68	7	16	312
CuDNN R2	cudnn.SpatialConvolution	56	143	53	6	11	269
Theano	CorrMM	91	143	121	24	28	407
Caffe	ConvolutionLayer<Dtype>	93	136	116	24	27	396
Torch-7	nn.SpatialConvolutionMM	94	149	123	24	28	418
DeepCL	ConvolutionLayer	738	1241	518	47	104	2648
cherry-picking****	best per layer	36	27	6	2	8	79

backward (gradInput + gradWeight)