kweonwooj/papers

Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling

Opened this issue · 0 comments

Abstract

  • Propose "Bi-Directional Block Self Attention Network" (Bi-BloSAN) for sequence modeling
    • Better than RNN : Bi-BloSAN can capture long-range dependencies
    • Better than CNN : Bi-BloSAN has better performance
    • Better than existing SAN : Bi-BloSAN is memory-efficient in long sequence input
  • SoTA on 9 different NLP benchmark tasks

Details

  • Introduction
    • RNN : classic tool for Sequence Modeling with LSTM, GRU, SRU as popular modules
    • CNN : highly parallelizable and fast inference models introduced, such as ByteNet, ConvS2S
    • SAN : self-attention based approaches recently replacing SoTA in sequence modeling tasks, such as Transformer, DiSAN

Background on Attention

screen shot 2018-01-28 at 11 21 16 pm

Bi-BloSAN Model Architecture

  • Encoder
    • FC layer -> forward/backward bi-directional -> concat -> source2token self-attention

screen shot 2018-01-28 at 11 26 05 pm

  • Bi-BloSA
    • Intra-block SA using masked self-attention on size r block
    • Inter-block SA using source2token self-attention + LSTM-like gate
    • Context Fusion to generate long-term context
    • raw-input + local context from intra-block SA + long-term context from Context Fusion on final layer

screen shot 2018-01-28 at 11 23 52 pm

Experiments

Class Models
RNN Bi-LSTM, Bi-GRU, Bi-SRU
CNN Multi-CNN, Hrchy-CNN
SAN Multi-head, DiSAN
Task Datasets
Natural Language Inference SNLI
Reading Comprehension SQuAD
Semantic Relatedness SICK
Sentence Classification CR, MPQA, SUBJ, TREC, SST-1, SST-2
  • Results
    • SoTA performances on multiple tasks

screen shot 2018-01-28 at 11 32 39 pm
screen shot 2018-01-28 at 11 33 07 pm
screen shot 2018-01-28 at 11 33 11 pm
screen shot 2018-01-28 at 11 33 16 pm

Ablation Study

  • Importance of Modules
    • source2token self-attention > mBloSA > Local/Global

screen shot 2018-01-28 at 11 36 31 pm

  • Train/Inference Time Cost & Memory Consumption in SNLI task
    • Faster than RNN
    • Better performance than CNN
    • Less memory than SAN

screen shot 2018-01-28 at 11 32 46 pm

  • Inference Time & Memory Consumption with varying Sequence Length
    • Faster than DiSAN, RNN
    • memory-efficient than DiSAN

screen shot 2018-01-28 at 11 39 11 pm

Personal Thoughts

  • source2token self-attention only has token as input, how can it represent relation to entire source sentence?
  • model setting for RNN, CNN, SAN seems to be unfair. Should match param size or FLOPs for fair comparison. Multi-head (Transformer) is better in both inference time and memory consumption than Bi-BloSAN, the model capacity may have been insufficient
  • code is released in Python + Tensorflow v1.3 👍

Link : https://openreview.net/pdf?id=H1cWzoxA-
Authors : Anonymous et al. 2018