Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling

Question

Opened this issue 7 years ago · 0 comments

Abstract

Propose "Bi-Directional Block Self Attention Network" (Bi-BloSAN) for sequence modeling
- Better than RNN : Bi-BloSAN can capture long-range dependencies
- Better than CNN : Bi-BloSAN has better performance
- Better than existing SAN : Bi-BloSAN is memory-efficient in long sequence input
SoTA on 9 different NLP benchmark tasks

Introduction
- RNN : classic tool for Sequence Modeling with LSTM, GRU, SRU as popular modules
- CNN : highly parallelizable and fast inference models introduced, such as ByteNet, ConvS2S
- SAN : self-attention based approaches recently replacing SoTA in sequence modeling tasks, such as Transformer, DiSAN

Vanilla Attention (Bahdanau et al. 2015)
- key, query, value operation in compatibility function -> softmax function to transform alignment score into probability -> weighted average of source tokens and probability to form context s
- compatibility function can be either multiplicative(dot-product) (Vaswani et al. 2017, Sukhbaatar et al. 2015, Rush et al. 2015) or additive (Bahdanau et al. 2015, Shange et al. 2015)
Multi-dimensional Attention
- alignment score is computed for each feature in word embedding dimension
Token2Token Self-Attention (Hu et al. 2017, Vaswani et al. 2017, Shen et al. 2017)
- key, query, value = tokens. vanilla attention is applied to itself
Source2Token Self-Attention (Lin et al. 2017, Shen et al. 2017, Liu et al. 2016)
- represents the importance of each token to the entire sentence
Masked Self-Attention
- self-attention with positional information via positional mask M

Encoder
- FC layer -> forward/backward bi-directional -> concat -> source2token self-attention

Bi-BloSA
- Intra-block SA using masked self-attention on size r block
- Inter-block SA using source2token self-attention + LSTM-like gate
- Context Fusion to generate long-term context
- raw-input + local context from intra-block SA + long-term context from Context Fusion on final layer

Train/Inference Time Cost & Memory Consumption in SNLI task
- Faster than RNN
- Better performance than CNN
- Less memory than SAN

Inference Time & Memory Consumption with varying Sequence Length
- Faster than DiSAN, RNN
- memory-efficient than DiSAN

source2token self-attention only has token as input, how can it represent relation to entire source sentence?
model setting for RNN, CNN, SAN seems to be unfair. Should match param size or FLOPs for fair comparison. Multi-head (Transformer) is better in both inference time and memory consumption than Bi-BloSAN, the model capacity may have been insufficient
code is released in Python + Tensorflow v1.3 👍