Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling
Opened this issue · 0 comments
kweonwooj commented
Abstract
- Propose "Bi-Directional Block Self Attention Network" (Bi-BloSAN) for sequence modeling
- Better than RNN : Bi-BloSAN can capture long-range dependencies
- Better than CNN : Bi-BloSAN has better performance
- Better than existing SAN : Bi-BloSAN is memory-efficient in long sequence input
- SoTA on 9 different NLP benchmark tasks
Details
- Introduction
Background on Attention
- Vanilla Attention (Bahdanau et al. 2015)
key, query, value
operation in compatibility function -> softmax function to transform alignment score into probability -> weighted average of source tokens and probability to form contexts
- compatibility function can be either multiplicative(dot-product) (Vaswani et al. 2017, Sukhbaatar et al. 2015, Rush et al. 2015) or additive (Bahdanau et al. 2015, Shange et al. 2015)
- Multi-dimensional Attention
- alignment score is computed for each feature in word embedding dimension
- Token2Token Self-Attention (Hu et al. 2017, Vaswani et al. 2017, Shen et al. 2017)
key, query, value
= tokens. vanilla attention is applied to itself
- Source2Token Self-Attention (Lin et al. 2017, Shen et al. 2017, Liu et al. 2016)
- represents the importance of each token to the entire sentence
- Masked Self-Attention
- self-attention with positional information via positional mask
M
- self-attention with positional information via positional mask
Bi-BloSAN Model Architecture
- Encoder
- FC layer -> forward/backward bi-directional -> concat -> source2token self-attention
- Bi-BloSA
- Intra-block SA using masked self-attention on size
r
block - Inter-block SA using source2token self-attention + LSTM-like gate
- Context Fusion to generate long-term context
- raw-input + local context from intra-block SA + long-term context from Context Fusion on final layer
- Intra-block SA using masked self-attention on size
Experiments
Class | Models |
---|---|
RNN | Bi-LSTM, Bi-GRU, Bi-SRU |
CNN | Multi-CNN, Hrchy-CNN |
SAN | Multi-head, DiSAN |
Task | Datasets |
---|---|
Natural Language Inference | SNLI |
Reading Comprehension | SQuAD |
Semantic Relatedness | SICK |
Sentence Classification | CR, MPQA, SUBJ, TREC, SST-1, SST-2 |
- Results
- SoTA performances on multiple tasks
Ablation Study
- Importance of Modules
- source2token self-attention > mBloSA > Local/Global
- Train/Inference Time Cost & Memory Consumption in SNLI task
- Faster than RNN
- Better performance than CNN
- Less memory than SAN
- Inference Time & Memory Consumption with varying Sequence Length
- Faster than DiSAN, RNN
- memory-efficient than DiSAN
Personal Thoughts
- source2token self-attention only has token as input, how can it represent relation to entire source sentence?
- model setting for RNN, CNN, SAN seems to be unfair. Should match param size or FLOPs for fair comparison. Multi-head (Transformer) is better in both inference time and memory consumption than Bi-BloSAN, the model capacity may have been insufficient
- code is released in Python + Tensorflow v1.3 👍
Link : https://openreview.net/pdf?id=H1cWzoxA-
Authors : Anonymous et al. 2018