Achieving Human Parity on Automatic Chinese to English News Translation

Question

Opened this issue 6 years ago · 0 comments

Abstract

reports that the quality of Microsoft's Chinese-to-English machine translation on news sentences is at human parity
- uses WMT2017 news translation data
- defines how to accurately measure human parity in translation
- describes the workflow and various experiments

official definition : If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
Evaluation method
- use direct assessment described in WMT17
- use source-based evaluation methodology described in IWSLT17
- annotators are shown source text and a candidate translation and asked the questions How accurately does the above candidate text convey the semantics of the source text?, answering this using a slider ranging from 0 to 100 (100 being perfect)
- to identify unreliable crowd workers, direct assessment includes artificially degraded translation output randomly

LSTM, ConvS2S, Transformer are all SoTA models, but choose Transformer as baseline

Learn a bilingual sentence vector representation mapped into the same space to filter the noisy data and select relevant data
- use method in Zoph et al. 2016 on subset of data known to be of good quality and relevant domain
- use RNN enc-dec similar to GNMT as base model for representation learning
- use cosine similarity of sentence representation of source S and target T
- remove sentences with similarity below a specified threshold
Rule-based filtering
- both source and target sentence should contain at least 3 words, at most 70 words
- pairs with ( src_len < 1.3 * tgt_len, tgt_len < 1.3 * src_len) removed
- sentences with illegal chars (URL, char of other language) removed
- Chinese sentence without any Chinese characters removed
- duplicated sentence pairs are removed

Dual Unsupervised Learning use reconstruction log-likelihood of monolingual corpus while training
Dual Supervised Learning trains primal and dual model simultaneously under regularization to encourage duality in probability distribution

signals from R2L model can be leveraged to alleviate the exposure bias problem of L2R model and vice versa

combine n-best hypotheses from all systems and train a re-ranker using k-best MIRA (margin-based classification algorithm)
Features used for re-ranking are
- original system score, 5-gram LM score, R2L score, Target2Source system re-score, cross-lingual sentence similarity between source and hypothesis
turned out to be that original system score, LM score, R2L score, R2L sentence vector similarity and Target2Source sentence similarity were best features

Train ZhEn, EnZh Transformer model using DUL, DSL with bilingual corpus (multiple models can be trained to leverage ensemble)
Generate back-translation corpus using En & Zh monolingual sentences and pre-trained models from previous step
Train Transformer Model or Deliberation Network with inflated bilingual corpus, use pre-trained model's weight to initialize encoder and first-pass decoder of Deliberation Network

Data
- WMT17 EnZh 18M bilingual pairs. newsdev2017 as dev, newstest2017 as test set.
- use LM trained on 18M bilingual pair to filter monolingual sentence from news.crawl and common.crawl
Vocab
- Byte Pair Encoding (BPE) of Zh 44k, En 33k
Model
- Transformer Big with Tensor2Tensor v1.3.0 open-source
- 8x M40 GPUs
- 200k Adam w learning_rate 0.3, decayed with noam schedule
- 5120 words per batch, checkpoints created every 60 min
- results are reported on averaged parameters of last 20 checkpoints
- beam=8, length_penalty=1.0
- reported score using sacreBLEU v1.2.3
BLEU Score
- Back-translation (BT) + Dual Learning + Deliberation Network combination performs best
- Agreement Regularization does not add improvement much

Data
- WMT17 18M + 35M/50M subset selected from 100M UN corpus
  - use Cross-Entropy selection by Moore et al. 2010 and Axelrod et al. 2011
  - use SentVect similarity filtering described in above Data Selection tab with threshold 0.2
Vocab : same
Model
- Transformer Big with 8,192 hidden_size in conv-1 block (bigger than original Transformer Big)
- 300K Adam
- minibatch of 3,500 with 8 GPUs
- same beam, length_penalty and averaging param
BLEU Score
- Base8k (larger model) performs better
- additional corpus selected via SentVect enhances BLEU score best

Motivated to resolve issues with human evaluation processes
- Annotator variability : what if same annotator provides different results on same data? -> resolved by running three campaigns on same evaluation data, and has seen a near complete overlap
- Data variability : conduct evaluations on completely different subsets of the test data. although, the test data may already be biased..

Complete NMT workflow from data selection upto human evaluation and error analysis
impressed on intense experiments, but methods are biased toward MS ideas
human evaluation comparison between baseline model and improved model would have been interesting, see how the error types had been reduced