Transformer Architecture Search: A Survey

(TAS)

Introduction

This is a summary of Neural Architecture Search (NAS) on Transformer, named TAS.

Papers are divided into the following categories based on Domain Specific:

  • Natrual Language Processing (NLP)
  • Computer Vision (CV)
  • Automatic Speech Recognition (ASR)
  • Other

Each TAS work is introduced from three aspects: Search Space, Search Strategy, and Estimation Strategy, which are key compoments of NAS and also important for TAS.

More details of each work has been summarized in our survey paper, coming soon.

Summary of TAS

Representative Work of Transformer Architecture Search (TAS), where H-level, M-level and A-level Refer to Hyperparameter-level, Module-level and Architecture-level, Respectively. Besides, EA, RL, GO and RS Are Denoted for Evolutionary Algorithm, Reinforcement Learning, Gradient Optimization and Random Search, Respectively.

Category Task Method Search Space
(3-level: Hyperparameter-level, Module-level, Architecture-level)
Search Strategy & Performance Estimation Publication
Natural Language
Processing
Improving Performace Auto-Sizing Transformer H-level: Auto reducing the size of model in training Gradient optimization & Train a supermodel arxiv 2019-10
SandWich Transformer A-level: The permutation (order) of FFN and Self-Attention Random search & Sample solutions, train and evaluate them arxiv 2020-4
Multi-Pass Transformer A-level: information flow in Encoders
(1. multi-pass encoder 2. information flow between these encoders)
Random search & Sample solutions, train and evaluate them arxiv 2020-9
Evolved Transformer M-level: cell search space EA & Sample solutions, train them with early stopping for their evaluation ICML 2019
AutoTrans M-level: 1.cell search space 2.activation 3.Norm
4.heads number 5.relative dimension
RL & Sample solutions in a supermodel
(one-shot method)
arxiv 2020-9
CAS M-level: Modification: 1.AddLinear
2.AddLSTM 3.FixSubset
Coordinate architecture search & Sample solution, fine-tune and evaluate them arxiv 2019-10
NAS-BERT M-level: 1.Embedding Size 2.heads number
3.Hidden Size 4. SepConv 3 5 7
5. Identity
Direct sample (selection) & Sample solutions in a supermodel
(one-shot method) and performance approximation
KDD 2021
AdaBert arxiv 2021-1
AutoBERT-Zero arxiv 2021-7
Length-Adaptive Transformer arxiv 2021-6
AutoTinyBERT ACL 2021
Primer NeurIPS 2021
Hardware-aware
Deployment
Fast Transformers M-level:1.Dimension of Q K V 2. heads number
3. LN mean value 4. Width of depth of FFN
Sampling distribution optimization & Sample solutions in a supermodel
(one-shot method)
arxiv 2020-8
HAT M-level: 1. Layer num in encoder/decoder
2. Dimension of embedding, hidden layer in FFN and heads number
3.Arbitrary encoder-decoder attention (Link)
EA & Sample solutions in a supermodel (one-shot method)
and surrogate hareware predictor
ACL 2020
RankNAS M-level: 1. Layer num in encoder/decoder
2. Dimension of embedding, hidden layer in FFN and heads number
3.Arbitrary encoder-decoder attention (Link)
---------------------Extra to HAT ---------------
4. Norm type(Pre-LN, Post-LN)
5.RPR Len [8,12,16] ( the maximum relative position Representations)
Random search/EA & Sample solutions in a supermodel (one-shot method)
and rank and select them by the ranking model
arxiv 2021-9
DARTsFormer M-level: • Standard Conv w × 1: for w in 3, 5, 7, 11.
• Dynamic Conv w × 1: for w in 3, 7, 11, 15.
• Self Attention; • FFN.
• Cross Attention: Only available to decoder.
• Gated Linear Unit (GLU).
• Zero: Return a zero tensor of the input size.
• Identity: Return the input.
Gradient optimization (Multi-split reversible network for reducing memory)
& Train a supermodel
arxiv 2021-5
Quantization Mixed Precision Quantization Transformer H-level: 1-bit, 2-bit, 4-bit and 8-bit Gradient optimization & Train a supermodel ICASSP 2021
Other TextNAS Not Transformer: 1. Convolutional Layers
2.Recurrent Layers 3.Pooling Layers
4.Multi-Head Self-Attention Layers
RL same as ENAS & Sample solutions in a supermodel
(one-shot method)
AAAI 2020
Computer
Vision
AutoFormer ICCV 2021
GLiT ICCV 2021
Multi-stage Vit-ResNAS ICCV 2021
PSViT arxiv 2021-8
ViTAS arxiv 2021-6
HR-NAS CVPR 2021
Vanilla NAS for CNN NAT (not search for Transformer) PAMI 2021
BossNAS
Speech Speech Recognition Evolved Speech-Transformer M-level: cell space (same as Evolved Transformer) EA & Sample solutions and progressive dynamic hurdles (early stopping) INTERSPEECH 2020
Streaming Transformer SLT 2020
Improved Conformer via NAS arxiv 2021-4
DARTS-Conformer arxiv 2021-8
LightSpeech ICASSP 2021
Other Analyzing and Mitigating Interference arxiv 2021-8
UNINET arxiv 2021-10

1.png 2.png