TinyML and Efficient Deep Learning Computing

강의 주제: TinyML and Efficient Deep Learning Computing
Instructor : Song Han(Associate Professor, MIT EECS)
[schedule(2023 Fall)] | [schedule(2022 Fall)] | [youtube]

💡 목표

효율적인 추론 방법 공부

딥러닝 연산에 있어서 효율성을 높일 수 있는 알고리즘을 공부한다.
제한된 성능에서의 딥러닝 모델 구성

디바이스의 제약에 맞춘 효율적인 딥러닝 모델을 구성한다.

🚩 정리한 문서 목록

📖 Basics of Deep Learning

Efficiency Metrics

latency, storage, energy

Memory-Related(#parameters, model size, #activations), Computation(MACs, FLOP)

📔 Efficient Inference

Pruning Granularity, Pruning Critertion

unstructured/structured pruning

magnitude-based pruning(L1-norm), second-order-based pruning, percentage-of-zero-based pruning, regression-based pruning
Automatic Pruning, Lottery Ticket Hypothesis

Pruning Ratio, Sensitivity Analysis, Automatic Pruning(AMC, NetAdapt)

Lottery Ticket Hypothesis(Winning Ticket, Iterative Magnitude Pruning, Scaling Limitation), Pruning with Regularization

Pruning at Initialization(Connection Sensitivity)
System & Hardware Support for Sparsity

EIE(CSC format: relative index, column pointer)

M:N Sparsity
Basic Concepts of Quantization

Numeric Data Types: Integer, Fixed-Point, Floating-Point(IEEE FP32/FP16, BF16, NVIDIA FP8), INT4 and FP4

Uniform vs Non-uniform quantization, Symmetric vs Asymmetric quantization
Vector Quantization, Linear Quantization

Vector Quantization(VQ): Deep Compression(iterative pruning, retrain codebook, Huffman encoding), Product Quantization(PQ): AND THE BIT GOES DOWN

Linear Quantization: Zero point, Scaling Factor, Quantization Error(clip error, round error), Linear Quantized Matrix Multiplization(FC layer, Conv layer)
Post Training Quantization

Weight Quantiztion: Per-Tensor Activation Per-Channel Activation, Group Quantization(Per-Vector, MX), Weight Equalization, Adative Rounding

Activation Quantization: During training(EMA), Calibration(Min-Max, KL-divergence, Mean Squared Error)

Bias Correction, Zero-Shot Quantization(ZeroQ)
Quantization-Aware Training, Low bit-width quantization

Fake quantization, Straight-Through Estimator

Binary Quantization(Deterministic, Stochastic, XNOR-Net), Ternary Quantization
Neural Architecture Search: basic concepts & manually-designed neural networks

input stem, stage, head

AlexNet, VGGNet, SqueezeNet(global average pooling, fire module, pointwise convolution), ResNet50(bottleneck block, residual learning), ResNeXt(grouped convolution)

MobileNet(depthwise-separable convolution, width/resolution multiplier), MobileNetV2(inverted bottleneck block), ShuffleNet(channel shuffle), SENet(squeeze-and-excitation block), MobileNetV3(redesigning expensive layers, h-swish)
Neural Architecture Search: RNN controller & search strategy

cell-level search space, network-level search space

design the search space: Cumulative Error Distribution, FLOPs distribution

Search Strategy: grid search, random search, reinforcement learning, bayesian optimization, gradient-based search, evolutionary search

EfficientNet(compound scaling), DARTS
Neural Architecture Search: Performance Estimation & Hardware-Aware NAS

Weight Inheritance, HyperNetwork, Weight Sharing(super-network, sub-network)

Performance Estimation Heuristics: Zen-NAS, GradSign

Hardware-Aware NAS(ProxylessNAS, HAT), One-Shot NAS(Once-for-All)
Knowledge Distillation

Knowledge Distillation(distillation loss, temperature)

KD: matching intermediate weights/features/attention maps/sparsity pattern/relational information(layers, samples)
Self Distillation, Online Distlliation, Applications

Self Distillation, Online Distillation, Combining Online and Self-Distillation, Network Augmentation

Applications: Object Detection, Semantic Segmentation, GAN, NLP
MCUNet

microcontroller, flash/SRAM usage, peak SRAM usage, MCUNet: TinyNAS, TinyEngine

TinyNAS: automated search space optimization(weight/resolution multiplier), resource-constrained model specialization(Once-for-All)

MCUNetV2: patch-based inference, network redistribution, joint automated search for optimization, MCUNetV2 architecture(VWW dataset inference)

RNNPool, MicroNets(MOPs & latency/energy consumption relationship)

⚙️ Efficient Training and System Support

TinyEngine

memory hierarchy of MCU, data layout(NCHW, NHWC, CHWN)

TinyEngine: Loop Unrolling, Loop Reordering, Loop Tiling, SIMD programming, Im2col, In-place depthwise convolution, appropriate data layout(pointwise, depthwise convolution), Winograd convolution