/micronet_competition

Submission to NeurIPS 2019 micronet contest

Primary LanguagePython

Micronet submission

  • Architecture: Quantized ProxylessNAS (mobile14 variant)
  • Accuracy: 75.012 (37506/50000) ImageNet validation top1
  • Cost: 0.5558137893676758

Requirements

Suggested installation

  • On a Linux server with at least one GPU available, pull the official Pytorch 1.1.0 image:
docker pull pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-devel
  • Run the image, possibly with extra ram if training (tested with 32GB), mounting in ImageNet:
docker run -it -v /path/to/imagenet:/imagenet --shm-size 32G pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-devel
  • Install Git LFS:
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
  • Install Brevitas:
pip install git+https://github.com/Xilinx/brevitas.git
  • Install APEX:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  • Clone the current repo:
git clone https://github.com/volcacius/micronet_competition.git

Evaluate accuracy

From the micronet_competition repo, run:

python multiproc.py --nproc_per_node 1 main.py --data /imagenet--resume ./proxylessnas_4b_5b/model_best.pth.tar --evaluate

Please note that accuracy experiences some jitter depending on batch size, i.e. +/-11 correct classified images depending on the batch size. The reported one is the lowest that was found, and it's above the required thresholds of 37500. The cause for this behaviour is most probably a bug in topk reported here pytorch/pytorch#27542 that shows inconsistent behaviour when dealing with duplicate values. In a floating point setting, that almost never happens, but for quantized values is much more common.

Evaluate cost

From the micronet_competition repo, run:

python multiproc.py --nproc_per_node 1 main.py --data /scratch/datasets/imagenet_symlink/ --resume ./proxylessnas_4b_5b/model_best.pth.tar --compute-micronet-cost

Target hardware

The quantized model can be implemented with fully integer hardware, without any explicit form of scaling. This type of approach builds on the work developed by Xilinx Research around dataflow style FPGA implementations of highly quantized mixed-precision models [1] [2].

In particular, implementing monotonic activation (relu and hardtanh in this case) with thresholding allows to retain floating point scale factor at training time, while having a fully integer pipeline at inference time. Conceptually it works as follows:

  • For a linear (conv, fc, etc.) layer with quantized inputs and weights, compute the bit width of the output accumulator, i.e. the size in terms of bits of the largest possible value generated by the layer. This is done by Brevitas.
  • Given the output bit-width, generate each possible integer output value, and generate a separate identical set per-channel.
  • Aggregate all the floating point scale factors and biases applied to the integer representation of the output, and apply them per-channel to each value that you previously generated. Typically this include the scale factor of the input, the scale factors of the weights, the scaling induced by batch norm, and the shift induced by batch norm.
  • Pass each of the value you generated through the quantized activation function. Each value is now requantized to some quantization level.
  • Because the activation function is monotonic, the per-channel outputs are increasing w.r.t. to the original integer inputs.
  • For each channel, look at when the output of the quantized activation goes from one quantization level to the next one. Go back through your ops and find the integer value of the input accumulator that triggers that jump. That value is the threshold for that quantization level in that channel. The threshold then has the same size as the accumulator at the output of your linear layer.
  • In hardware then, in order to compute your quantized activation, you can just compare each value of your integer accumulator with the set of thresholds corresponding to that channel. This is implemented in finn-hls library (https://finn-hlslib.readthedocs.io/en/latest/?badge=latest)/

Model quantization

The taken approach to quantization goes as follows:

  • Input preprocessing is simplified to a single scalar mean and variance, and left at 8 bit.
  • Weights of the first layer are quantized to 8 bits.
  • Inputs and weights of depthwise convolutions are quantized to 5 bits.
  • All other weights are quantized to 4 bits.
  • Scale factor for weights and activation are in floating point, and assumed to be merged within thresholds.
  • Weights have per-channel scaling factors, with the exception of the fully connected layer, that has a per tensor scale factor. This avoids rescaling the output of the network.
  • Activations have per-tensor scaling factors, with the exception of activations that are input to depthwise convolutions, where all the scale factors are per channel. That's because no reduction over the channel dimension happens in the convolution, so the per-channel scale factors of the input can be propagated to the thresholds after the convolution.
  • Scale factors of activation are initialized at 6.0 (given that the full precision model was trained with ReLU6) and learned from there.
  • Scale factors of weights are learned as max(abs(W)).
  • Batch norm is left unquantized and assumed to be merged into thresholds (both additive and multiplicative factors).
  • Weights are quantized with narrow_range=True, so that the sign possibly introduced by the multiplicative factors of batch norm can be merged into them. This make sure that the output of a set of threshold is always increasing w.r.t to the thresholds.
  • In the full precision version of the network, the last ConvBlock of each ProxylessBlock doesn't have an activation function. In the quantized version, a quantized hardtanh activation is inserted. This hardtanh is shared between each block connected by a residual element-wise add, so that the inputs to the element-wise add have the same scale factor. The hardtanh function is also called after the elementwise add to requantize the output.
  • Division i avg pooling is implemented by means of a truncation to 4 bits.
  • Bias of the fully connected layer cannot be merged into thresholds (as there are none at the end of the network), so it is quantized to the precision of the output accumulator, with the scale factor of the output accumulator.

Training

Training was perfomed in two steps, starting from a pre-trained full precision trained model on 3 Nvidia P6000. AutoAugment policies for ImageNet are used, while CutMix or MixUp are left disabled (because the full precision model was trained without).

BREVITAS_IGNORE_MISSING_KEYS=1 python multiproc.py --nproc_per_node 3 main.py --data /imagenet --workspace /path/to/4b/workspace --batch-size 96 --epochs 60 --print-freq 10 --lr 0.005 --label-smoothing 0.1 --workers 10 --weight-decay 3e-5 --quant-type INT --bit-width 4 --first-layer-bit-width 8 --weight-scaling-impl-type STATS --hard-tanh-threshold 10 --lr-schedule step --milestones 15,30,45 --resume /path/to/proxylessnas_mobile14-0662-0c0ad983.pth --finetune --depthwise-bit-width 4
  • Because the 4b model doesn't reach the required target accuracy, retrain by incresing the bit-width of depthwise layers (both inputs and weights) to 5 bits, as well as using dropout 0.2 with 32 steps of multisampling:
python multiproc.py --nproc_per_node 3 main.py --data /imagenet --workspace /path/to/4b_5b/workspace --batch-size 96 --epochs 20 --print-freq 10 --lr 0.0001 --label-smoothing 0.1 --workers 10 --weight-decay 3e-5 --quant-type INT --bit-width 4 --bn-no-wd --first-layer-bit-width 8 --weight-scaling-impl-type STATS --hard-tanh-threshold 10 --lr-schedule step --milestones 5,10,15 --resume /path/to/4b/workspace/model_best.pth.tar --finetune --depthwise-bit-width 5 --dropout-rate 0.2 --dropout-steps 32

Credits

Implementation based on the following repositories. Respective licenses can be found under LICENSES.

Author

Alessandro Pappalardo @ Xilinx Research Labs.

References

[1] Umuroglu, Yaman, et al. "Finn: A framework for fast, scalable binarized neural network inference." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.

[2] Blott, Michaela, et al. "FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks." ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11.3 (2018): 16.