/Memonger_Densenet_Resnet_MobileNet_LSTM

This project contains a series of python script to give sublinear memory plans of deep neural networks including Densenet, Resnet, Mobilenet-v2 and LSTM. This allows you to trade computation for memory and get sublinear memory cost, so you can train bigger/deeper nets with limited resources.

Primary LanguagePythonMIT LicenseMIT

Memonger_Densenet_Resnet

This project contains a series of python script to give sublinear memory plans of deep neural networks including Densenet, Resnet, Mobilenet-v2 and LSTM. This allows you to trade computation for memory and get sublinear memory cost, so you can train bigger/deeper nets with limited resources.

Mxnet

Densenet/ResNet/MobileNwt-v2

Thank for tqchen, we improve the utilization of memory space when training the recognition via Resnet in Mxnet. Besides, we have further improved the memory space utilization of Densenet and Mobilenet-v2 through allocating memory reasonably.

*Experiments:

opt Network GPU Speed Mem for Per FM (old) Mem for Per FM (new)
0 Resnet-50 4 517 samples/s 30537 MB 14466 MB
1 Resnet-50 4 421 samples/s 15896 MB 7530 MB
2 Densenet201 4 341 samples/s 125202 MB 35470 MB
3 Densenet201 4 363 samples/s 26084 MB 7389 MB

*Note: Due to time constraints, relevant documents of Mobilenet-v2 will be updated in the near future

ResNet/LSTM

Thank for tqchen, the Resnet and LSTM is completed via Mxnet with efficiently memory utilization.

This project contains a 150 lines of python script to give sublinear memory plans of deep neural networks. This allows you to trade computation for memory and get sublinear memory cost, so you can train bigger/deeper nets with limited resources.

Reference Paper

Training Deep Nets with Sublinear Memory Cost Arxiv 1604.06174

How to Use

This code is based on MXNet, a lightweight, flexible and efficient framework for deep learning.

  • Configure your network as you normally will do using symbolic API
  • Give hint to the allocator about the possible places that we need to bookkeep computations.
    • Set attribute mirror_stage='True', see example_resnet.py
    • The memonger will try to find possible dividing points on the nodes that are annotated as mirror_stage.
  • Call memonger.search_plan to get an symbolic graph with memory plan.
import mxnet as mx
import memonger

# configure your network
net = my_symbol()

# call memory optimizer to search possible memory plan.
net_planned = memonger.search_plan(net)

# use as normal
model = mx.FeedForward(net_planned, ...)
model.fit(...)

Write your Own Memory Optimizer

MXNet's symbolic graph support attribute to give hint on whether (mirror attribute) a result can be recomputed or not. You can choose to re-compute instead of remembering a result for less memory consumption. To set output of a symbol to be re-computable, use

sym._set_attr(force_mirroring='True')

mxnet-memonger actually use the same way to do memory planning. You can simply write your own memory allocator by setting the force_mirroring attribute in a smart way.

Pytorch

Thank for gpleiss, the densenet is completed via Pytoch with efficiently memory utilization.

A PyTorch >=1.0 implementation of DenseNets, optimized to save GPU memory.

Recent updates

  1. Now works on PyTorch 1.0! It uses the checkpointing feature, which makes this code WAY more efficient!!!

Motivation

While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

This implementation uses a new strategy to reduce the memory consumption of DenseNets. We use checkpointing to compute the Batch Norm and concatenation feature maps. These intermediate feature maps are discarded during the forward pass and recomputed for the backward pass. This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

This implementation is inspired by this technical report, which outlines a strategy for efficient DenseNets via memory sharing.

Requirements

  • PyTorch >=1.0.0
  • CUDA

Usage

In your existing project: There is one file in the models folder.

If you care about speed, and memory is not an option, pass the efficient=False argument into the DenseNet constructor. Otherwise, pass in efficient=True.

Options:

  • All options are described in the docstrings of the model files
  • The depth is controlled by block_config option
  • efficient=True uses the memory-efficient version
  • If you want to use the model for ImageNet, set small_inputs=False. For CIFAR or SVHN, set small_inputs=True.

Running the demo:

The only extra package you need to install is python-fire:

pip install fire
  • Single GPU:
CUDA_VISIBLE_DEVICES=0 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>
  • Multiple GPU:
CUDA_VISIBLE_DEVICES=0,1,2 python demo.py --efficient True --data <path_to_folder_with_cifar10> --save <path_to_save_dir>

Options:

  • --depth (int) - depth of the network (number of convolution layers) (default 40)
  • --growth_rate (int) - number of features added per DenseNet layer (default 12)
  • --n_epochs (int) - number of epochs for training (default 300)
  • --batch_size (int) - size of minibatch (default 256)
  • --seed (int) - manually set the random seed (default None)

Performance

A comparison of the two implementations (each is a DenseNet-BC with 100 layers, batch size 64, tested on a NVIDIA Pascal Titan-X):

Implementation Memory cosumption (GB/GPU) Speed (sec/mini batch)
Naive 2.863 0.165
Efficient 1.605 0.207
Efficient (multi-GPU) 0.985 -

LuaTorch

Thank for Gao Huang, the densenet is completed via LuaTorch with efficiently memory utilization.

The standard (orginal) implementation of DenseNet with recursive concatenation is very memory inefficient. This can be an obstacle when we need to train DenseNets on high resolution images (such as for object detection and localization tasks) or on devices with limited memory.

In theory, DenseNet should use memory more efficiently than other networks, because one of its key features is that it encourages feature reusing in the network. The fact that DenseNet is "memory hungry" in practice is simply an artifact of implementation. In particular, the culprit is the recursive concatenation which re-allocates memory for all previous outputs at each layer. Consider a dense block with N layers, the first layer's output has N copies in the memory, the second layer's output has (N-1) copies, ..., leading to a quadratic increase (1+2+...+N) in memory consumption as the network depth grows.

Using optnet (-optMemory 1) or shareGradInput (-optMemory 2), we can significantly reduce the run-time memory footprint of the standard implementaion (with recursive concatenation). However, the memory consumption is still a quadratic function in depth.

We implement a customized densely connected layer (largely motivated by the Caffe implementation of memory efficient DenseNet by Tongcheng), which uses shared buffers to store the concatenated outputs and gradients, thus dramatically reducing the memory footprint of DenseNet during training. The mode -optMemory 3 activates shareGradInput and shared output buffers, while the mode -optMemory 4 further shares the memory to store the output of the Batch-Normalization layer before each 1x1 convolution layer. The latter makes the memory consumption linear in network depth, but introduces a training time overhead due to the need to re-forward these Batch-Normalization layers in the backward pass.

In practice, we suggest using the default -optMemory 2, as it does not require customized layers, while the memory consumption is moderate. When GPU memory is really the bottleneck, we can adopt the customized implementation by setting -optMemory to 3 or 4, e.g.,

th main.lua -netType densenet -dataset cifar10 -batchSize 64 -nEpochs 300 -depth 100 -growthRate 12 -optMemory 4

The following time and memory footprint are benchmarked on a DenseNet-BC (l=100, k=12) on CIFAR-10, and on an NVIDIA TitanX GPU:

opt Memory Memory Time (s/mini-batch) Description
0 5453M 0.153 Original implementation
1 3746M 0.153 Original implementation with optnet
2 2969M 0.152 Original implementation with shareGradInput
3 2188M 0.155 Customized implementation with shareGradInput and sharePrevOutput
4 1655M 0.175 Customized implementation with shareGradInput, sharePrevOutput and shareBNOutput

Tensorflow

Thank for Joe Yearsley, the densenet is completed via Tensorflow with efficiently memory utilization.

Motivation

While DenseNets are fairly easy to implement in deep learning frameworks, most implementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth.

It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

This implementation uses a new strategy to reduce the memory consumption of DenseNets. It is based on efficient_densenet_pytorch. It makes use of checkpointing intermeditate features and alternate approach.

This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

For more details, please see the technical report.

How to checkpoint

Currently all of the dense layers are checkpointed, however you can alter the implementation to trade of speed and memory. For example by checkpointing earlier layers you remove intermediate checkpoints which are generally larger earlier on due to the pooling layers.

However more strategies can be found in the alternate approach.

Example setup for a 12gb Nvidia GPU

python train.py --batch_size 6000 --efficient True

python train.py --batch_size 3750

Main piece of code:

models/densenet_creator.py#116

        def _x(ip):
            x = batch_normalization(ip, **self.bn_kwargs)
            x = tf.nn.relu(x)

            if self.bottleneck:
                inter_channel = nb_filter * 4

                x = conv2d(x, inter_channel, (1, 1), kernel_initializer='he_normal', padding='same', use_bias=False,
                           **self.conv_kwargs)
                x = batch_normalization(x, **self.bn_kwargs)
                x = tf.nn.relu(x)

            x = conv2d(x, nb_filter, (3, 3), kernel_initializer='he_normal', padding='same', use_bias=False,
                       **self.conv_kwargs)

            if self.dropout_rate:
                x = dropout(x, self.dropout_rate, training=self.training)

            return x

        if self.efficient:
            # Gradient checkpoint the layer
            _x = tf.contrib.layers.recompute_grad(_x)

Requirement

  • Tensorflow 1.9+
  • Horovod

Usage

If you care about speed, and memory is no object, pass the efficient=False argument into the DenseNet constructor. Otherwise, pass in efficient=True.

Important Options:

  • --batch_size (int) - The number of images per batch (default 3750)
  • --fp16 (bool) - Whether to run with FP16 or not (default False)
  • --efficient (bool) - Whether to run with gradient checkpointing or not (default False)

Caffe

Caffe fork src: https://github.com/Tongcheng/caffe/

Features

This is an implementation that use O(T) space for data, where T is number of transitions within DenseBlock, for the simple model where totalLayer L = 40 and growthRate k = 12, number of transition in each DenseBlock is 12.

In comparison, the original implementation will take O(T^2) amount of space for data.

This version currently runs (without dropout) 6 iters/second, and use space less than 2 GB on GPU for L=40,k=12 model.

The way the linear space is implemented is:

(i) setting TensorDescriptor for cudnn explicitly, allowing stride between different image. (That means initially data deployed was not continuous, our process will fill in the blank data in the middle.)

(ii) In the backward phase, we first use BN forward and ReLU forward to reconstruct the corresponding data for a transition. Then we apply the normal backward procedure.

How to use it:

  1. clone the source
  2. mkdir build
  3. cmake .. (if you are using atlas) OR cmake .. -DBLAS=open (if you are using openblas)
  4. make all Thank for Tongcheng Li, the densenet is completed via Caffe with efficiently memory utilization.