kweonwooj/papers

Semi-Autoregressive Neural Machine Translation

Opened this issue · 0 comments

Abstract

  • propose a novel model for fast sequence generation - the Semi-Autoregressive Transformer (SAT)
  • produce multiple successive words in parallel at each time step (K=2,4,6 etc)
  • achieves good balance between translation quality and decoding speed on WMT14 EnDe, EnZh
    • 5.58x speed up with 88% translation quality in EnDe (max speed up)
    • when K=2, SAT is almost lossless (only 1% loss in BLEU)

Details

Introduction

  • Sequence Generation tasks suffer from autoregressive nature (need to decode output one by one in sequence)
    • although CNN, self-attention modules allowed parallel processing in source/encoder side, target/decoder side is still autoregressive
  • Recent Works
    • Gu et al. 2017 proposed fully non-autoregressive NMT model with fertility function to predict the target length. Significant gain in speed, but degrade translation quality too much.
    • Lee et al. 2018 proposed non-autoregressive sequence model with iterative refinement, but quality still suffers
    • Kaiser et al 2018 proposed semi-autoregressive model where Transformer model first auto-encodes the sentence into shorter sequence of discrete latent variable in sequence, from which the target sentence is generated in parallel.

Semi-Autoregressive Transformer

screen shot 2018-11-01 at 11 19 19 am

  • Group-level Chain-Rule
    • chain rule is applied to group of tokens with size K
      screen shot 2018-11-01 at 11 20 13 am
  • Long-Distance Prediction
    • model is prediction K steps ahead
      screen shot 2018-11-01 at 11 21 05 am
  • Relaxed Causal Mask
    • masking strategy is different in train time
      screen shot 2018-11-01 at 11 21 36 am
  • Complexity and Acceleration (a = time on decoder network, b = time on beam search)
    screen shot 2018-11-01 at 11 22 07 am

Train

  • train with knowledge distillation (teacher-student model) for better performance
    screen shot 2018-11-01 at 11 23 11 am

Result

  • WMT14 EnDe
    • with K=2, BLEU is 26.90 (compared to SoTa 27.11), 1.51x speed up
    • good balance of speed and quality compared to other non-autoregressive methods
      screen shot 2018-11-01 at 11 23 33 am
  • NIST02 EnZh
    • with K=2, BLEU is 39.57 (compared to SoTa 40.59), 1.69x speed up
      screen shot 2018-11-01 at 11 25 07 am

Case Study

  • Position-wise Cross-Entropy is high on latter position, indicating that the long-distance prediction is always more difficult
    screen shot 2018-11-01 at 11 26 02 am
  • observe frequent repetition issue

Future Work

  • better design loss function or model for long-distance prediction
  • explore more stable method of training along with KD
  • adaptively determine size K by network

Personal Thoughts

  • Nice implementation. Idea itself is not super-creative because NAT has been out, and it is natural to think of semi-autoregressive model
  • surprised to see that KD helps a lot in training
  • very practical paper

Link : https://arxiv.org/pdf/1808.08583v2.pdf
Code : https://github.com/chqiwang/sa-nmt
Authors : Wang et al 2018