Semi-Autoregressive Neural Machine Translation
Opened this issue · 0 comments
kweonwooj commented
Abstract
- propose a novel model for fast sequence generation - the Semi-Autoregressive Transformer (SAT)
- produce multiple successive words in parallel at each time step (K=2,4,6 etc)
- achieves good balance between translation quality and decoding speed on WMT14 EnDe, EnZh
- 5.58x speed up with 88% translation quality in EnDe (max speed up)
- when K=2, SAT is almost lossless (only 1% loss in BLEU)
Details
Introduction
- Sequence Generation tasks suffer from autoregressive nature (need to decode output one by one in sequence)
- although CNN, self-attention modules allowed parallel processing in source/encoder side, target/decoder side is still autoregressive
- Recent Works
- Gu et al. 2017 proposed fully non-autoregressive NMT model with fertility function to predict the target length. Significant gain in speed, but degrade translation quality too much.
- Lee et al. 2018 proposed non-autoregressive sequence model with iterative refinement, but quality still suffers
- Kaiser et al 2018 proposed semi-autoregressive model where Transformer model first auto-encodes the sentence into shorter sequence of discrete latent variable in sequence, from which the target sentence is generated in parallel.
Semi-Autoregressive Transformer
- Group-level Chain-Rule
- Long-Distance Prediction
- Relaxed Causal Mask
- Complexity and Acceleration (
a
= time on decoder network,b
= time on beam search)
Train
Result
- WMT14 EnDe
- NIST02 EnZh
Case Study
- Position-wise Cross-Entropy is high on latter position, indicating that the long-distance prediction is always more difficult
- observe frequent repetition issue
Future Work
- better design loss function or model for long-distance prediction
- explore more stable method of training along with KD
- adaptively determine size K by network
Personal Thoughts
- Nice implementation. Idea itself is not super-creative because NAT has been out, and it is natural to think of semi-autoregressive model
- surprised to see that KD helps a lot in training
- very practical paper
Link : https://arxiv.org/pdf/1808.08583v2.pdf
Code : https://github.com/chqiwang/sa-nmt
Authors : Wang et al 2018