/convnet-burden

Memory consumption and FLOP count estimates for convnets

Primary LanguageMATLABMIT LicenseMIT

convnet-burden

Estimates of memory consumption and FLOP counts for various convolutional neural networks.

Image Classification Architectures

The numbers below are given for single element batches.

model input size param mem feat. mem flops src performance
alexnet 227 x 227 233 MB 3 MB 727 MFLOPs MCN 41.80 / 19.20
caffenet 224 x 224 233 MB 3 MB 724 MFLOPs MCN 42.60 / 19.70
squeezenet1-0 224 x 224 5 MB 30 MB 837 MFLOPs PT 41.90 / 19.58
squeezenet1-1 224 x 224 5 MB 17 MB 360 MFLOPs PT 41.81 / 19.38
vgg-f 224 x 224 232 MB 4 MB 727 MFLOPs MCN 41.40 / 19.10
vgg-m 224 x 224 393 MB 12 MB 2 GFLOPs MCN 36.90 / 15.50
vgg-s 224 x 224 393 MB 12 MB 3 GFLOPs MCN 37.00 / 15.80
vgg-m-2048 224 x 224 353 MB 12 MB 2 GFLOPs MCN 37.10 / 15.80
vgg-m-1024 224 x 224 333 MB 12 MB 2 GFLOPs MCN 37.80 / 16.10
vgg-m-128 224 x 224 315 MB 12 MB 2 GFLOPs MCN 40.80 / 18.40
vgg-vd-16-atrous 224 x 224 82 MB 58 MB 16 GFLOPs N/A - / -
vgg-vd-16 224 x 224 528 MB 58 MB 16 GFLOPs MCN 28.50 / 9.90
vgg-vd-19 224 x 224 548 MB 63 MB 20 GFLOPs MCN 28.70 / 9.90
googlenet 224 x 224 51 MB 26 MB 2 GFLOPs MCN 34.20 / 12.90
resnet18 224 x 224 45 MB 23 MB 2 GFLOPs PT 30.24 / 10.92
resnet34 224 x 224 83 MB 35 MB 4 GFLOPs PT 26.70 / 8.58
resnet-50 224 x 224 98 MB 103 MB 4 GFLOPs MCN 24.60 / 7.70
resnet-101 224 x 224 170 MB 155 MB 8 GFLOPs MCN 23.40 / 7.00
resnet-152 224 x 224 230 MB 219 MB 11 GFLOPs MCN 23.00 / 6.70
resnext-50-32x4d 224 x 224 96 MB 132 MB 4 GFLOPs L1 22.60 / 6.49
resnext-101-32x4d 224 x 224 169 MB 197 MB 8 GFLOPs L1 21.55 / 5.93
resnext-101-64x4d 224 x 224 319 MB 273 MB 16 GFLOPs PT 20.81 / 5.66
inception-v3 299 x 299 91 MB 89 MB 6 GFLOPs PT 22.55 / 6.44
SE-ResNet-50 224 x 224 107 MB 103 MB 4 GFLOPs SE 22.37 / 6.36
SE-ResNet-101 224 x 224 189 MB 155 MB 8 GFLOPs SE 21.75 / 5.72
SE-ResNet-152 224 x 224 255 MB 220 MB 11 GFLOPs SE 21.34 / 5.54
SE-ResNeXt-50-32x4d 224 x 224 105 MB 132 MB 4 GFLOPs SE 20.97 / 5.54
SE-ResNeXt-101-32x4d 224 x 224 187 MB 197 MB 8 GFLOPs SE 19.81 / 4.96
SENet 224 x 224 440 MB 347 MB 21 GFLOPs SE 18.68 / 4.47
SE-BN-Inception 224 x 224 46 MB 43 MB 2 GFLOPs SE 23.62 / 7.04
densenet121 224 x 224 31 MB 126 MB 3 GFLOPs PT 25.35 / 7.83
densenet161 224 x 224 110 MB 235 MB 8 GFLOPs PT 22.35 / 6.20
densenet169 224 x 224 55 MB 152 MB 3 GFLOPs PT 24.00 / 7.00
densenet201 224 x 224 77 MB 196 MB 4 GFLOPs PT 22.80 / 6.43
mcn-mobilenet 224 x 224 16 MB 38 MB 579 MFLOPs AU 29.40 / -

Click on the model name for a more detailed breakdown of feature extraction costs at different input image/batch sizes if needed. The performance numbers are reported as top-1 error/top-5 error on the 2012 ILSVRC validation data. The src column indicates the source of the benchmark scores using the following abberviations:

  • MCN - scores obtained from the matconvnet website.
  • PT - scores obtained from the PyTorch torchvision module.
  • L1 - evaluated locally (follow link to view benchmark code).
  • AU - numbers reported by the paper authors.

These numbers provide an estimate of performance, but note that there may be small differences between the evaluation scripts from different sources.

References:

  • alexnet - Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
  • squeezenet - Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).
  • vgg-m - Chatfield, Ken, et al. "Return of the devil in the details: Delving deep into convolutional nets." arXiv preprint arXiv:1405.3531 (2014).
  • vgg-vd-16/vgg-vd-19 - Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
  • vgg-vd-16-reduced - Liu, Wei, Andrew Rabinovich, and Alexander C. Berg. "Parsenet: Looking wider to see better." arXiv preprint arXiv:1506.04579 (2015)
  • googlenet - Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
  • inception - Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
  • resnet - He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  • resnext - Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." arXiv preprint arXiv:1611.05431 (2016).
  • SENets - Jie Hu, Li Shen and Gang Sun. "Squeeze-and-Excitation Networks." arXiv preprint arXiv:1709.01507 (2017).
  • Densenet - Huang, Gao, et al. "Densely connected convolutional networks." CVPR, (2017).

Object Detection Architectures

model input size param memory feature memory flops
rfcn-res50-pascal 600 x 850 122 MB 1 GB 79 GFLOPS
rfcn-res101-pascal 600 x 850 194 MB 2 GB 117 GFLOPS
ssd-pascal-vggvd-300 300 x 300 100 MB 116 MB 31 GFLOPS
ssd-pascal-vggvd-512 512 x 512 104 MB 337 MB 91 GFLOPS
ssd-pascal-mobilenet-ft 300 x 300 22 MB 37 MB 1 GFLOPs
faster-rcnn-vggvd-pascal 600 x 850 523 MB 600 MB 172 GFLOPS

The input sizes used are "typical" for each of the architectures listed, but can be varied. Anchor/priorbox generation and roi/psroi-pooling are not included in flop estimates. The ssd-pascal-mobilenet-ft detector uses the MobileNet feature extractor (the model used here was imported from the architecture made available by chuanqi305).

References:

  • faster-rcnn - Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015..
  • r-fcn - Li, Yi, Kaiming He, and Jian Sun. "R-fcn: Object detection via region-based fully convolutional networks." Advances in Neural Information Processing Systems. 2016.
  • ssd - Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
  • mobilenets - Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).

Semantic Segmentation Architectures

model input size param memory feature memory flops
pascal-fcn32s 384 x 384 519 MB 423 MB 125 GFLOPS
pascal-fcn16s 384 x 384 514 MB 424 MB 125 GFLOPS
pascal-fcn8s 384 x 384 513 MB 426 MB 125 GFLOPS
deeplab-vggvd-v2 513 x 513 144 MB 755 MB 202 GFLOPs
deeplab-res101-v2 513 x 513 505 MB 4 GB 346 GFLOPs

In this case, the input sizes are those which are typically taken as input crops during training. The deeplab-res101-v2 model uses multi-scale input, with scales x1, x0.75, x0.5 (computed relative to the given input size).

References:

  • pascal-fcn - Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015..
  • deeplab - DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs Liang-Chieh Chen^, George Papandreou^, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille (^equal contribution) Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

Keypoint Detection Architectures

model input size param memory feature memory flops
multipose-mpi 368 x 368 196 MB 245 MB 134 GFLOPS
multipose-coco 368 x 368 200 MB 246 MB 136 GFLOPS

References:

  • multipose - Cao, Zhe, et al. "Realtime multi-person 2d pose estimation using part affinity fields." arXiv preprint arXiv:1611.08050 (2016)..

Notes and Assumptions

The numbers for each architecture should be reasonably framework agnostic. It is assumed that all weights and activations are stored as floats (with 4 bytes per datum) and that all relus are performed in-place. Feature memory therefore represents an estimate of the total memory consumption of the features computed via a forward pass of the network for a given input, assuming that memory is not re-used (the exception to this is that, as noted above, relus are performed in-place and do not add to the feature memory total). In practice, many frameworks will clear features from memory when they are no-longer required by the execution path and will therefore require less memory than is noted here. The feature memory statistic is simply a rough guide as to "how big" the activations of the network look.

Fused multiply-adds are counted as single operations. The numbers should be considered to be rough approximations - modern hardware makes it very difficult to accurately count operations (and even if you could, pipelining etc. means that it is not necessarily a good estimate of inference time).

The tool for computing the estimates is implemented as a module for the autonn wrapper of matconvnet and is included in this repo, so feel free to take a look for extra details. This module can be installed with the vl_contrib package manager (it has two dependencies which can be installed in a similar manner: autonn and mcnExtraLayers). Matconvnet versions of all of the models can be obtained from either here or here.

For further reading on the topic, the 2017 ICLR submission An analysis of deep neural network models for practical applications is interesting. If you find any issues, or would like to add additional models, add an issue/PR.