For MicroNet Challenge PLEASE 👉CHECK HERE👈
This repository contains single path one-shot NAS networks MXNet (Gluon) implementation, modified from
the official pytorch implementation. In this work, an open-sourced weights sharing Neural Architecture Search (NAS) pipeline is provided. It can finish training and searching on ImageNet totally within 60
GPU hours (on 4 V100 GPUs, including supernet training, supernet searching and the searched best subnet training) in the exploration space of about 32^20
choices.
Several things different from the official version: for training, it supports block & channel selection for the supernet model, ShuffleNetV2+ style SE for supernet / subnet and the MobileNet V3 style last convolutin block; for searching, it supports both genetic and random search with BN statistics update and the FLOP / number of parameters constraint; for evaluation and deployment, tools for FLOPs and parameters calculation, per operator profiling, int8 quantizatin and Batch Norm merging are provided.
By utilizing this implementation, a new state-of-the-art NAS searched model has been found which outperforms other NAS models like Single Path One Shot, FBNet, MnasNet, DARTS, NASNET, PNASNET
by a good margin in all factors of FLOPS, number of parameters and Top-1 / Top-5 accuracies. Also for considering the MicroNet Challenge Σ Normalized Scores, before any quantization, it outperforms other popular base models like MobileNet V2, V3, ShuffleNet V1, V2, V2+
too.
10/09/2019 Update:
A searched model Oneshot-S+, with the block choices and channel choices searched by this repo's implementation, ShuffleNetV2+ style SE and MobileNetV3 last convolution block design, reaches the new highest top-1 & top-5 accuracies with the new lowest Google MicroNet Challenge Σ Normalized Scores. Check here for comparison.
09/30/2019 Update:
A customized model Oneshot+, with the block choices and channel choices provided from paper, ShuffleNetV2+ style SE and MobileNetV3 last convolution block design, reaches the highest top-1 & top-5 accuracies with the lowest Google MicroNet Challenge Σ Normalized Scores. Check here for comparison.
Model | FLOPs | # of Params | Top - 1 | Top - 5 | Σ Normalized Scores | Scripts | Logs |
---|---|---|---|---|---|---|---|
OneShot+ Supernet | 841.9M | 15.4M | 62.90 | 84.49 | 7.09 | script | log |
OneShot-S+ | 291M | 3.5M | 75.75 | 92.77 | 1.9166 | script | log |
OneShot+ | 297M | 3.7M | 75.24 | 92.58 | 1.9937 | script | log |
OneShot (our) | 328M | 3.4M | 74.02 | 91.60 | 2 | script | log |
OneShot (official) | 328M | 3.4M | 74.9 | 92.0 | 2 | - | - |
FBNet-B | 295M | 4.5M | 74.1 | - | 2.19 | - | - |
MnasNet | 317M | 4.2M | 74.0 | 91.8 | 2.20 | - | - |
MobileNetV3 Large | 217M | 5.4M | 75.2 | - | 2.25 | - | - |
DARTS | 574M | 4.7M | 73.3 | 91.3 | 3.13 | - | - |
NASNET-A | 564M | 5.3M | 74.0 | 91.6 | 3.28 | - | - |
PNASNET | 588M | 5.1M | 74.2 | 91.9 | 3.29 | - | - |
MobileNetV2 (1.4) | 585M | 6.9M | 74.7 | - | 3.81 | - | - |
Download the ImageNet dataset, reorgnize the raw data and create MXNet RecordIO files (or just put the validation images in its corresponding class folder) by following this script.
Set up the environments.
python3 -m pip install --user --upgrade pip
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
Train & search
# Train supernet
sh ./scripts/train_supernet.sh
# Search supernet
sh ./scripts/search_supernet.sh
# Train best searched model
sh ./scripts/train_fixArch+.sh
Our approach is mainly based on the Single Path One Shot NAS in the combination of Squeeze and Excitation (SE), ShuffleNet V2+ and MobileNet V3. Like the original paper, we searched for the choice blocks and block channels with multiple FLOPs and parameter amount constraints. In this section, we will elaborate on the modifications from the original paper.
For each ShuffleNasBlock
, four choice blocks were explored, ShuffleNetBlock-3x3 (SNB-3)
, SNB-5
, SNB-7
and ShuffleXceptionBlock-3x3 (SXB-3)
. Within each block, eight channel choices are avialable: [0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] * (BlockOutputChannel / 2)
. So each ShuffleNasBlock
explores 32
possible choices and there are 20
blocks in this implementation, counting for totaly 32^20
design choices.
We also applied the SE, ShuffleNet V2+ SE layout and the MobileNet V3 last convolution block design in the supernet. Finally, the supernet contains 15.4
Million trainable parameters and the possible subnet FLOPs range from 168M
to 841M
.
Unlike what the original paper did, in the training stage, we didn't apply uniform distribution from the beginning. We train the supernet totally 120
epochs. In the first 60
epochs doing Block selection only and, for the upcoming 60
epochs, we used Channel Selection Warm-up which gradually allows the supernet to be trained with a larger range of channel choices.
# Supernet sampling schedule: during channel selection warm-up
1 - 60 epochs: Only block selection (BS), Channels are set to maximum (here [2.0])
61 epoch: [1.8, 2.0] + BS
62 epoch: [1.6, 1.8, 2.0] + BS
63 epoch: [1.4, 1.6, 1.8, 2.0] + BS
64 epoch: [1.2, 1.4, 1.6, 1.8, 2.0] + BS
65 - 66 epochs: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
67 - 69 epochs: [0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
70 - 73 epochs: [0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
The reason why we did this in the supernet training is that during our experiments we found, for supernet without SE, doing Block Selection from beginning works well, nevertheless doing Channel Selection from the beginning will cause the network not converging at all. The Channel Selection range needs to be gradually enlarged otherwise it will crash with free-fall drop accuracy. And the range can only be allowed for (0.6 ~ 2.0)
. Smaller channel scales will make the network crashing too. For supernet with SE, Channel Selection with the full choices (0.2 ~ 2.0)
can be used from the beginning and it converges. However, doing this seems like harming accuracy. Compared to the same se-supernet with Channel Selection warm-up, the Channel Selection from scratch model has been always left behind 10%
training accuracy during the whole procedure.
Different from the paper, we jointly searched for the Block choices and Channel Choices in the supernet at the same time. It means that for each instance in the population of our genetic algorithm it contains 20
Block choice genes and 20
Channel choice genes. We were aiming to find a combination of these two which optimizing for each other and being complementary.
For each qualified subnet structure (has lower Σ Normalized Scores
than the baseline OneShot searched model), like most weight sharing NAS approaches did, we updated the BN statistics firstly with 20,000
fixed training set images and then evaluate this subnet ImageNet validation accuracy as the indicator for its performance.
For the final searched model, we build and train it from scratch. No previous supernet weights are reused in the subnet.
As for the hyperparameters. We modified the GluonCV official ImageNet training script to support both supernet training and subnet training. We trained both models with initial learning rate 1.3
, weight decay 0.00003
, cosine learning rate scheduler, 4 GPUs each with batch size 256
, label smoothing and no weight decay for BN beta gamma. Supernet was trained 120
epochs and subnet was trained 360
epochs.
Model | FLOPs | # of Params | Top - 1 | Top - 5 | Σ Normalized Scores | Scripts | Logs |
---|---|---|---|---|---|---|---|
OneShot+ Supernet | 1684M | 15.4M | 62.9 | 84.5 | 3.67 | script | log |
We tried both random search, randomly selecting 250 qualified instances to evaluate their performance, and genetic search. The genetic method easily found a better subnet structure over the random selection.
Model | FLOPs | # of Params | Top - 1 | Top - 5 | Σ Normalized Scores | Scripts | Logs |
---|---|---|---|---|---|---|---|
OneShot+ Supernet | 841.9M | 15.4M | 62.90 | 84.49 | 7.09 | script | log |
OneShot-S+ | 291M | 3.5M | 75.75 | 92.77 | 1.9166 | script | log |
OneShot-S+ noBN | 291M | 3.5M | 75.6 | 92.8 | 1.9166 | script | log |
OneShot+ | 297M | 3.7M | 75.24 | 92.58 | 1.9937 | script | log |
OneShot (our) | 328M | 3.4M | 74.02 | 91.60 | 2 | script | log |
OneShot (official) | 328M | 3.4M | 74.9 | 92.0 | 2 | - | - |
FBNet-B | 295M | 4.5M | 74.1 | - | 2.19 | - | - |
MnasNet | 317M | 4.2M | 74.0 | 91.8 | 2.20 | - | - |
MobileNetV3 Large | 217M | 5.4M | 75.2 | - | 2.25 | - | - |
DARTS | 574M | 4.7M | 73.3 | 91.3 | 3.13 | - | - |
NASNET-A | 564M | 5.3M | 74.0 | 91.6 | 3.28 | - | - |
PNASNET | 588M | 5.1M | 74.2 | 91.9 | 3.29 | - | - |
MobileNetV2 (1.4) | 585M | 6.9M | 74.7 | - | 3.81 | - | - |
A detailed op to op profiling can be found here. The calculation here follows MicroNet Challenge way. It's slightly different from how most paper reported FLOPs.
op_name | quantizable | inp_size | kernel_size | Cin | Cout | params(M) | mults(M) | adds(M) | MFLOPS |
---|---|---|---|---|---|---|---|---|---|
First conv | True | 224 | 3 | 3 | 16 | 0.000 | 5.419 | 5.218 | 10.637 |
HSwish | False | 112 | -1 | 16 | 16 | 0.000 | 0.603 | 0.201 | 0.804 |
SNB-3x3 | Mixed | 112 | 3 | 16 | 64 | 0.005 | 23.800 | 21.739 | 45.539 |
SNB-3x3 | Mixed | 56 | 3 | 64 | 64 | 0.004 | 12.136 | 11.255 | 23.391 |
SNB-3x3 | Mixed | 56 | 3 | 64 | 64 | 0.002 | 10.511 | 9.697 | 20.208 |
SNB-5x5 | Mixed | 56 | 5 | 64 | 64 | 0.005 | 16.389 | 15.451 | 31.840 |
SNB-3x3 | Mixed | 56 | 3 | 64 | 160 | 0.021 | 32.111 | 30.707 | 62.818 |
SNB-3x3 | Mixed | 28 | 3 | 160 | 160 | 0.023 | 17.573 | 16.859 | 34.432 |
SNB-5x5 | Mixed | 28 | 5 | 160 | 160 | 0.014 | 9.746 | 9.232 | 18.978 |
SNB-3x3 | Mixed | 28 | 3 | 160 | 160 | 0.015 | 11.103 | 10.538 | 21.641 |
SXB-3x3 | Mixed | 28 | 3 | 160 | 320 | 0.082 | 14.060 | 13.692 | 27.752 |
SNB-7x7 | Mixed | 14 | 7 | 320 | 320 | 0.080 | 11.834 | 11.582 | 23.416 |
SNB-3x3 | Mixed | 14 | 3 | 320 | 320 | 0.051 | 6.416 | 6.215 | 12.631 |
SNB-5x5 | Mixed | 14 | 5 | 320 | 320 | 0.063 | 8.898 | 8.673 | 17.571 |
SNB-7x7 | Mixed | 14 | 7 | 320 | 320 | 0.080 | 11.834 | 11.582 | 23.416 |
SNB-7x7 | Mixed | 14 | 7 | 320 | 320 | 0.091 | 14.168 | 13.891 | 28.059 |
SNB-5x5 | Mixed | 14 | 5 | 320 | 320 | 0.098 | 15.448 | 15.146 | 30.594 |
SNB-7x7 | Mixed | 14 | 7 | 320 | 320 | 0.103 | 16.501 | 16.199 | 32.700 |
SNB-3x3 | Mixed | 14 | 3 | 320 | 320 | 0.323 | 25.640 | 25.380 | 51.020 |
SNB-3x3 | Mixed | 7 | 3 | 640 | 640 | 0.244 | 8.311 | 8.196 | 16.507 |
SNB-7x7 | Mixed | 7 | 7 | 640 | 640 | 0.298 | 10.983 | 10.856 | 21.839 |
SNB-3x3 | Mixed | 7 | 3 | 640 | 640 | 0.368 | 14.445 | 14.293 | 28.738 |
GAP | False | 7 | -1 | 640 | 640 | 0.000 | 0.001 | 0.031 | 0.032 |
Last conv | True | 1 | 1 | 640 | 1024 | 0.656 | 0.655 | 0.655 | 1.310 |
HSwish | False | 1 | -1 | 1024 | 1024 | 0.000 | 0.003 | 0.001 | 0.004 |
Classifier | True | 1 | 1 | 1024 | 1000 | 1.025 | 1.024 | 1.024 | 2.048 |
total_quant | True | 3.520 | 292.820 | 286.218 | 579.038 | ||||
total_no_quant | False | 0.132 | 6.801 | 2.105 | 8.905 | ||||
total | False | 3.652 | 299.621 | 288.323 | 587.943 |
- Implement the fixed architecture model from the official pytorch release.
- Implement the random block selection and channel selection.
- Verify conv kernel gradients would be be updated according to ChannelSelector
- Make the fixed architecture model hybridizable.
- Train a tiny model on Imagenet to verify the feasibility.
- Modify the open source MXNet FLOP calculator to support BN
- Verify that this repo's implementation shares the same # parameters and # FLOPs with the official one.
- Add SE and hard swish in the model (on/off can be controlled by --use-se)
- Add MobileNetV3 style last conv (on/off can be controlled by --last-conv-after-pooling)
- Train the official fixed architecture model on Imagenet
- Train the official uniform selection supernet model on Imagenet
- Add --use-all-blocks, --use-all-channels and --epoch-start-cs options for the supernet training.
- Add channel selection warm up: after epoch_start_cs, the channel selection range will be gradually increased.
- Train the supernet with --use-se and --last-conv-after-pooling --cs-warm-up
- Build the evolution algorithm to search within the pretrained supernet model.
- Build random search
- update BN before calculating the validation accuracy for each choice
- Build and do unit test on the customized BN for updating moving mean & variance during inference
- Replace nn.batchnorm with the customized BN
- Evolution algorithm
- Evolution algorithm with flop and # parameters constraint(s)
- Quantization
- To eliminate the possibility that BN may cause quantization problem, add merge BN tool
- To eliminate the possibility that reshape may cause quantization problem, add ShuffleChannelByConv option
- Follow up on this issue
- Search a model having both less FLOPs and # of parameters than MobileNet V3
- Add a searching mode which can specify hard FLOP and # of parameter constrains but not just the Σscores.
- Search within the OneShot supernet with provided stage channels, se and MobilNet V3 style conv
- This supernet setting cannot (quickly) find enough qualified candidates for population
- In progress: Train ShuffleNetV2+ channels layout supernet with se and MobilNet V3 style last convolution block.
- Train the best searched subnet model
- Two stage searching
- Do Block search firstly
- Based on the best searched blocks, do channel search
- Debug why training accuracy catastrophicly drops after several epochs of Channel Selection
- Save parameters' value and gradient in MXBorad to visualize
- Train a 'OneShot' channels layout supernet like before with channel selection enabled after 60 epochs
- Search in this supernet and compare with previous Block Selection alone supernet searching performance
- Train a 'OneShot' channels layout supernet like before with channel selection enabled from the beginning
- Search in this supernet and compare with the BS alone one, BS + 60 CS one and this one's performance
- Estimate each (block, # channel) combination cpu & gpu latency
- Build a tool to generate repeating blocks
- Estimate speeds for 4 choice blocks with different input/mid/output channels
- Train with constraint --> To limit unuseful subnet training
- Maintain a candidate pool which always contains enough (> 10) qualified candidates in background
- Only the candidates from the pool will be trained.
In this work, we provided a state-of-the-art open-sourced weight sharing Neural Architecture Search (NAS) pipeline, which can be trained and searched on ImageNet totally within 60
GPU hours (on 4 V100 GPUS) and the exploration space is about 32^20
. The model searched by this implementation outperforms the other NAS searched models, such as Single Path One Shot, FBNet, MnasNet, DARTS, NASNET, PNASNET
by a good margin in all factors of FLOPS, # of parameters and Top-1 accuracy. Also for considering the MicroNet Challenge Σ score, without any quantization, it outperforms MobileNet V2, V3, ShuffleNet V1, V2, V2+
.
We have not tried to use more aggressive weight / channel pruning or more complex low-bit quantization methods, because, if we want to take full advantage of them, most compression methods and low-bit quantization models require custom hardware. However, in general practical situations, we need to build / design a model that meets hardware constraints, but not build the hardware architecture based on the algorithm. We believe that this direction - design optimal searching space and search for further optimized network structures - is suitable for direct application.
If you use these models in your research, please cite:
@article{guo2019single,
title={Single path one-shot neural architecture search with uniform sampling},
author={Guo, Zichao and Zhang, Xiangyu and Mu, Haoyuan and Heng, Wen and Liu, Zechun and Wei, Yichen and Sun, Jian},
journal={arXiv preprint arXiv:1904.00420},
year={2019}
}