convnet-benchmarks

Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.

Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz + NVIDIA Titan X + Ubuntu 14.04 x86_64

##Imagenet Winners Benchmarking I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
Nervana-fp16	ConvLayer	92	29	62
CuDNN[R3]-fp16	cudnn.SpatialConvolution	96	30	66
CuDNN[R3]-fp32	cudnn.SpatialConvolution	96	32	64
Nervana-fp32	ConvLayer	101	32	69
fbfft	fbnn.SpatialConvolution	104	31	72
cudaconvnet2*	ConvLayer	177	42	135
CuDNN[R2] *	cudnn.SpatialConvolution	231	70	161
Caffe (native)	ConvolutionLayer	324	121	203
Torch-7 (native)	SpatialConvolutionMM	342	132	210
CL-nn (Torch)	SpatialConvolutionMM	963	388	574
Caffe-CLGreenTea	ConvolutionLayer	1442	210	1232

Overfeat [fast] - Input 128x3x231x231

Library	Class	Time (ms)	forward (ms)	backward (ms)
CuDNN[R3]-fp16	cudnn.SpatialConvolution	313	107	206
CuDNN[R3]-fp32	cudnn.SpatialConvolution	326	113	213
fbfft	SpatialConvolutionCuFFT	342	114	227
Nervana-fp16	ConvLayer	355	112	242
Nervana-fp32	ConvLayer	398	124	273
cudaconvnet2*	ConvLayer	723	176	547
CuDNN[R2] *	cudnn.SpatialConvolution	810	234	576
Caffe	ConvolutionLayer	823	355	468
Torch-7 (native)	SpatialConvolutionMM	878	379	499
CL-nn (Torch)	SpatialConvolutionMM	963	388	574
Caffe-CLGreenTea	ConvolutionLayer	2857	616	2240

OxfordNet [Model-A] - Input 64x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
Nervana-fp16	ConvLayer	529	167	362
Nervana-fp32	ConvLayer	590	180	410
CuDNN[R3]-fp16	cudnn.SpatialConvolution	615	179	436
CuDNN[R3]-fp32	cudnn.SpatialConvolution	615	196	418
fbfft	SpatialConvolutionCuFFT	1092	355	737
cudaconvnet2*	ConvLayer	1229	408	821
CuDNN[R2] *	cudnn.SpatialConvolution	1099	342	757
Caffe	ConvolutionLayer	1068	323	745
Torch-7 (native)	SpatialConvolutionMM	1105	350	755
CL-nn (Torch)	SpatialConvolutionMM	3437	875	2562
Caffe-CLGreenTea	ConvolutionLayer	5620	988	4632

GoogleNet V1 - Input 128x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
Nervana-fp16	ConvLayer	283	85	197
Nervana-fp32	ConvLayer	322	90	232
CuDNN[R3]-fp32	cudnn.SpatialConvolution	431	117	313
CuDNN[R3]-fp16	cudnn.SpatialConvolution	501	109	392
Caffe	ConvolutionLayer	1935	786	1148
CL-nn (Torch)	SpatialConvolutionMM	7016	3027	3988
Caffe-CLGreenTea	ConvolutionLayer	9462	746	8716

Layer-wise Benchmarking (Last Updated April 2015)

###Spatial Convolution layer (3D input 3D output, densely connected)

forward + backprop (wrt input and weights)

Original Library	Class/Function Benchmarked	Time (ms)	forward (ms)	backward (ms)
fbfft	SpatialConvolutionCuFFT	256	101	155
cuda-convnet2 *	ConvLayer	977	201	776
cuda-convnet**	pylearn2.cuda_convnet	1077	312	765
CuDNN R2 *	cudnn.SpatialConvolution	1019	269	750
Theano	CorrMM	1225	407	818
Caffe	ConvolutionLayer	1231	396	835
Torch-7	SpatialConvolutionMM	1265	418	877
DeepCL	ConvolutionLayer	6280	2648	3632
cherry-picking****	best per layer	235	79	155

This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.

Original Library	Class/Function Benchmarked	Time (ms)	forward (ms)	backward (ms)
Theano (experimental)***	conv2d_fft	1178	304	874
Torch-7	nn.SpatialConvolutionBHWD	1892	581	1311
ccv	ccv_convnet_layer	809+bw	809
Theano (legacy)	conv2d	70774	3833	66941

* indicates that the library was tested with Torch bindings of the specific kernels.
** indicates that the library was tested with Pylearn2 bindings.
*** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
**** The last row shows results obtainable when choosing the best-performing library for each layer.
L1 - Input: 128x128 Batch-size 128, Feature maps: 3->96, Kernel Size: 11x11, Stride: 1x1
L2 - Input: 64x64 Batch-size 128, Feature maps: 64->128, Kernel Size: 9x9, Stride: 1x1
L3 - Input: 32x32 Batch-size 128, Feature maps: 128->128, Kernel Size: 9x9, Stride: 1x1
L4 - Input: 16x16 Batch-size 128, Feature maps: 128->128, Kernel Size: 7x7, Stride: 1x1
L5 - Input: 13x13 Batch-size 128, Feature maps: 384->384, Kernel Size: 3x3, Stride: 1x1
The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)

#####Breakdown

forward

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
fbfft	SpatialConvolutionCuFFT	57	27	6	2	9	101
cuda-convnet2 *	ConvLayer	36	113	40	4	8	201
cuda-convnet**	pylearn2.cuda_convnet	38	183	68	7	16	312
CuDNN R2	cudnn.SpatialConvolution	56	143	53	6	11	269
Theano	CorrMM	91	143	121	24	28	407
Caffe	ConvolutionLayer<Dtype>	93	136	116	24	27	396
Torch-7	nn.SpatialConvolutionMM	94	149	123	24	28	418
DeepCL	ConvolutionLayer	738	1241	518	47	104	2648
cherry-picking****	best per layer	36	27	6	2	8	79

backward (gradInput + gradWeight)