/mcunet

Primary LanguagePythonApache License 2.0Apache-2.0

MCUNet: Tiny Deep Learning on IoT Devices [Website] [arXiv][Demo]

@article{lin2020mcunet,
  title={Mcunet: Tiny deep learning on iot devices},
  author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

[News] This repo is still under development. Please check back for the updates!

teaser

[MIT News] System brings deep learning to “internet of things” devices

Overview

Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications.

teaser

But the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.

teaser

MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. It consists of TinyNAS and TinyEngine. They are co-designed to fit the tight memory budgets.

With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.

teaser

Our TinyEngine inference engine could be a useful infrastructure for MCU-based AI applications. It significantly improves the inference speed and reduces the memory usage compared to existing libraries like TF-Lite Micro, CMSIS-NN, MicroTVM, etc.

  • TinyNAS improves the inference speed by 1.5-3x

teaser

  • TinyNAS reduces the peak memory by 2.7-4.8x

teaser

Framework Structure

The structure of the MCUNet implementation is as follows. The whole project is mostly based on a Python implementation, except for the device side deployment. TinyEngine and TinyNAS share the code of Code Generation, which is used for memory profiling during architecture search, and also the first step for device deployment.

diagram

Model Zoo

We provide the searched models for ImageNet, Visual Wake Words (VWW), and Google Speech Commands for comparison. The statistics on TF-Lite Micro and TinyEngine is based on int8 quantized version of the model (unless otherwise stated).

  • Inference engines: we compared both TF-Lite Micro and TinyEngine for the on-device statistics.
  • Model formats: we provide the model in both PyTorch floating-point format (with json definition of architecture and checkpoint (ckpt) for weights) and also TF-Lite int8 quantized format (tflite)
  • Statistics: we include
    • model statistics: computation (MACs), #parameters (Param), theoretial peak activation size (Act) (by summing input and output activation)
    • deployment statistics: peak SRAM usage (SRAM), Flash usage (Flash)

We provide download links in the following tables. Alternatively, you can download all the checkpoints and tf-lite files at once by running:

python jobs/download_all_models.py

ImageNet models

Comparison on STM32F746 (constraints: 320kB SRAM, 1MB Flash)

We first compare the baseline networks (scaled MobileNetV2, MbV2-s and scaled ProxylessNASMobile, Proxyless-s) with MCUNet on STM32F746. The baseline networks are scaled to fit the hardware constraints (we compoundly scale the width and resolution of the baseline networks and report the best accuracy under the constraints).

The latency and memory usage are measured on STM32F746 board. TinyEngine can reduce the Flash and SRAM usage and fit a larger model; TinyNAS can design network that has superior accuracy under the same memory budget.

Model Model Stats. TF-Lite Stats. TinyEngine Stats. Top-1 Acc. Top-5 Acc. Link
MbV2-s
for TF-Lite
(w0.3-r80)
MACs: 7.3M
Param: 0.63M
Act: 94kB
SRAM: 306kB
Flash: 943kB
Latency: 445ms
SRAM: 98kB
Flash: 745kB
Latency: 124ms
FP: 39.3%
int8: 38.6%
FP: 64.8%
int8: 64.1%
json
ckpt
tflite
MbV2-s
for TinyEngine
(w0.35-r144)
MACs: 23.5M
Param: 0.75M
Act: 303kB
SRAM: 646kB
Flash: 1060kB
Latency: OOM
SRAM: 308kB
Flash: 862kB
Latency: 393ms
FP: 49.7%
int8: 49.0%
FP: 74.6%
int8: 73.8%
json
ckpt
tflite
Proxyless-s
for TF-Lite
(w0.25-r112)
MACs: 10.7M
Param: 0.57M
Act: 98kB
SRAM: 288kB
Flash: 860kB
Latency: 660ms
SRAM: 114kB
Flash: 701kB
Latency: 169ms
FP: 44.9%
int8: 43.8%
FP: 70.0%
int8: 69.0%
json
ckpt
tflite
Proxyless-s
for TinyEngine
(w0.3-r176)
MACs: 38.3M
Param: 0.75M
Act: 242kB
SRAM: 526kB
Flash: 1063kB
Latency: OOM
SRAM: 292kB
Flash: 892kB
Latency: 572ms
FP: 57.0%
int8: 56.2%
FP: 80.2%
int8: 79.7%
json
ckpt
tflite
TinyNAS
for TinyEngine
(MCUNet)
MACs: 81.8M
Param: 0.74M
Act: 333kB
SRAM: 560kB
Flash: 1088kB
Latency: OOM
SRAM: 293kB
Flash: 897kB
Latency: 1075ms
FP: 62.2%
int8: 61.8%
FP: 84.5%
int8: 84.2%
json
ckpt
tflite

MCUNet under different memory constraints

We provide the MCUNet models under different memory constraints:

  • STM32 F412 (Cortex-M4, 256kB SRAM/1MB Flash)
  • STM32 F746 (Cortex-M7, 320kB SRAM/1MB Flash)
  • STM32 H743 (Cortex-M7, 512kB SRAM/2MB Flash)

The memory usage and latency are measured on the corresponding devices.

1. Int8 Models

Int8 quantization is the most widely used quantization and default setting in our experiments.

Constraints Model Stats. TF-Lite Stats. TinyEngine Stats. Top-1 Acc. Top-5 Acc. Link
256kB
1MB
MACs: 67.3M
Param: 0.73M
Act: 325kB
SRAM: 546kB
Flash: 1081kB
Latency: OOM
SRAM: 242kB
Flash: 878kB
Latency: 3535ms
FP: 60.9%
int8: 60.3%
FP: 83.3%
int8: 82.6%
json
ckpt
tflite
320kB
1MB
MACs: 81.8M
Param: 0.74M
Act: 333kB
SRAM: 560kB
Flash: 1088kB
Latency: OOM
SRAM: 293kB
Flash: 897kB
Latency: 1075ms
FP: 62.2%
int8: 61.8%
FP: 84.5%
int8: 84.2%
json
ckpt
tflite
512kB
2MB
MACs: 125.9M
Param: 1.7M
Act: 413kB
SRAM: 863kB
Flash: 2133kB
Latency: OOM
SRAM: 456kB
Flash: 1876kB
Latency: 617ms
FP: 68.4%
int8: 68.0%
FP: 88.4%
int8: 88.1%
json
ckpt
tflite
2. Int4 models

We can further reduce the memory usage with lower precision (int4). Notice that with int4 quantization, there will be a large accuracy drop compared to the float point models. Therefore, we have to perform quantization-aware training, and only report the quantized accuracy.

Note: Int4 quantization also does NOT bring further speed gain due to the instruction set. It may not be a better trade-off compared to int8 if inference latency is considered.

TODO: we are still organizing the code and the right model format for int4 model evaluation.

Constraints Model Stats. TinyEngine Stats. Top-1 Acc. Link
256kB
1MB
MACs: 134.5M
Param: 1.4M
Act: 244kB
SRAM: 233kB
Flash: 1008kB
int4: 62.0% TODO
320kB
1MB
MACs: 170.0M
Param: 1.4M
Act: 295kB
SRAM: 282kB
Flash: 1010kB
int4: 63.5% TODO
512kB
1MB
MACs: 466.8M
Param: 3.3M
Act: 495kB
SRAM: 498kB
Flash: 1986kB
int4: 70.7% TODO

MCUNet under different latency constraints

We provide the MCUNet models under different latency constraints. The models are designed to fit STM32F746 with 320kB SRAM and 1MB Flash.

Latency Limit Model Stats. TF-Lite Stats. TinyEngine Stats. Top-1 Acc. Top-5 Acc. Link
N/A MACs: 81.8M
Param: 0.74M
Act: 333kB
SRAM: 560kB
Flash: 1088kB
Latency: OOM
SRAM: 293kB
Flash: 897kB
Latency: 1075ms
FP: 62.2%
int8: 61.8%
FP: 84.5%
int8: 84.2%
json
ckpt
tflite
5FPS MACs: 12.8M
Param: 0.6M
Act: 90kB
SRAM: TODOkB
Flash: TODOkB
Latency: TODO
SRAM: 307kB
Flash: 992kB
Latency: 197ms
FP: 51.5%
int8: 49.9%
FP: 75.5%
int8: 74.1%
json
ckpt
tflite
10FPS MACs: 6.4M
Param: 0.7M
Act: 45kB
SRAM: TODOkB
Flash: TODOkB
Latency: OOM
SRAM: 266kB
Flash: 889kB
Latency: 92ms
FP: 41.5%
int8: 40.4%
FP: 66.3%
int8: 65.2%
json
ckpt
tflite

VWW & Speech Command models

Will be released soon.

Testing

We provide the script to test the accuracy of models, both the float-point models in Pytorch and int8 models in TF-Lite format.

To evaluate the accuracy of PyTorch models, run:

horovodrun -np 8 \
	python jobs/run_imagenet.py --evaluate --batch_size 50 \
	--train-dir PATH/TO/IMAGENET/train --val-dir PATH/TO/IMAGENET/val \
	--net_config assets/configs/mcunet-320kb-1mb_imagenet.json \
	--load_from assets/pt_ckpt/mcunet-320kb-1mb_imagenet.pth

To evaluate the accuracy of int8 TF-Lite models, run:

python jobs/eval_tflite.py \
	--val-dir PATH/TO/IMAGENET/val \
	--tflite_path assets/tflite/mcunet-320kb-1mb_imagenet.tflite

Model Profiling

Development and deployment on microcontrollers can be difficult and slow. However, during neural architecture search, we will need to frequently get the hardware statistics of a random neural architecture (e.g., peak SRAM, Flash usage, latency, etc.). To accelerate the process and promote the improvement of deep learning on microcontrollers, we develop profilers to get the model statistics without the need to run on the hardware.

We provide the script to profile the following statistics:

  • Model statistics: #MACs, #Param
  • Memory statistics: Peak SRAM, Flash usage (on both TF-Lite Micro and TinyEngine)
  • Latency statistics: on STM32F746 board (TODO: will be updated soon)

To evaluate the model statistics and deployment statitics from json config, run:

python jobs/profile_model.py assets/configs/mcunet-320kb-1mb_imagenet.json

You can also refer to the APIs to integrate the model profiler in your search pipeline.

Note that the profiled SRAM, Flash, and latency are estimated results. However, we find that the estimation is precise enough for neural architecture search. We will provide a detailed analysis on the estimation precision.

Code Cleaning Plan

Due to the time limit, we cannot finish all the code cleaning for now (all the features have been implemented by the submission, but not all in a good-looking way :) ). The repo is still under development. Please check back for the updates.

Currently, our code cleaning plan is:

  • Latency profiler; comparison between estimated and measured memory/latency.
  • VWW & Speech Commands models.
  • Int4 model format & evaluation.
  • TinyNAS search & training.
  • TinyEngine on-device deployment.

Requirement

  • Python 3.6+

  • PyTorch 1.4.0+

  • Tensorflow 1.15

  • Horovod

Acknowledgement

We thank MIT Satori cluster for providing the computation resource. We thank MIT-IBM Watson AI Lab, Qualcomm, NSF CAREER Award #1943349 and NSF RAPID Award #2027266 for supporting this research.

Part of the code is taken from once-for-all project for development.

Related Projects

TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning (NeurIPS'20)

Once for All: Train One Network and Specialize it for Efficient Deployment (ICLR'20)

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware (ICLR'19)

AutoML for Architecting Efficient and Specialized Neural Networks (IEEE Micro)

AMC: AutoML for Model Compression and Acceleration on Mobile Devices (ECCV'18)

HAQ: Hardware-Aware Automated Quantization (CVPR'19, oral)