Achieving Human Parity on Automatic Chinese to English News Translation
Opened this issue · 0 comments
kweonwooj commented
Abstract
- reports that the quality of Microsoft's Chinese-to-English machine translation on news sentences is at human parity
- uses WMT2017 news translation data
- defines how to accurately measure human parity in translation
- describes the workflow and various experiments
Details
Defining Human Parity
- official definition :
If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
- Evaluation method
- use
direct assessment
described in WMT17 - use
source-based
evaluation methodology described in IWSLT17 - annotators are shown source text and a candidate translation and asked the questions
How accurately does the above candidate text convey the semantics of the source text?
, answering this using a slider ranging from 0 to 100 (100 being perfect) - to identify unreliable crowd workers, direct assessment includes artificially degraded translation output randomly
- use
Neural Machine Translation
- LSTM, ConvS2S, Transformer are all SoTA models, but choose
Transformer
as baseline
Main Contributions
- Main Techniques used to achieve human parity
- Careful data selection and filtering
- Dual Learning to utilize the duality of the translation problem
- Iterative joint training algorithm described in Zhang et al. 2018 to enhance the effect of monolingual source using Back Translation
- Deliberation Network to refine translation based on two-pass decoding
- New training objective over KL divergence to encourage agreement between left-to-right and right-to-left translation
- System Combination and Re-ranking
Data Selection and Filtering
- Learn a bilingual sentence vector representation mapped into the same space to filter the noisy data and select relevant data
- use method in Zoph et al. 2016 on subset of data known to be of good quality and relevant domain
- use RNN enc-dec similar to GNMT as base model for representation learning
- use cosine similarity of sentence representation of source
S
and targetT
- remove sentences with similarity below a specified threshold
- Rule-based filtering
- both source and target sentence should contain at least 3 words, at most 70 words
- pairs with ( src_len < 1.3 * tgt_len, tgt_len < 1.3 * src_len) removed
- sentences with illegal chars (URL, char of other language) removed
- Chinese sentence without any Chinese characters removed
- duplicated sentence pairs are removed
Dual Learning
- Dual Unsupervised Learning use reconstruction log-likelihood of monolingual corpus while training
- Dual Supervised Learning trains primal and dual model simultaneously under regularization to encourage duality in probability distribution
Iterative Joint Training
Deliberation Network
L2R, R2L Agreement Regularization
- signals from R2L model can be leveraged to alleviate the exposure bias problem of L2R model and vice versa
System Combination and Re-ranking
- combine
n-best
hypotheses from all systems and train a re-ranker usingk-best
MIRA (margin-based classification algorithm) - Features used for re-ranking are
- original system score, 5-gram LM score, R2L score, Target2Source system re-score, cross-lingual sentence similarity between source and hypothesis
- turned out to be that
original system score, LM score, R2L score, R2L sentence vector similarity and Target2Source sentence similarity
were best features
NMT Pipeline
- Train ZhEn, EnZh Transformer model using DUL, DSL with bilingual corpus (multiple models can be trained to leverage ensemble)
- Generate back-translation corpus using En & Zh monolingual sentences and pre-trained models from previous step
- Train Transformer Model or Deliberation Network with inflated bilingual corpus, use pre-trained model's weight to initialize encoder and first-pass decoder of Deliberation Network
Experiments - Benchmark on WMT17
- Data
- WMT17 EnZh 18M bilingual pairs. newsdev2017 as dev, newstest2017 as test set.
- use LM trained on 18M bilingual pair to filter monolingual sentence from news.crawl and common.crawl
- Vocab
- Byte Pair Encoding (BPE) of Zh 44k, En 33k
- Model
- Transformer Big with Tensor2Tensor v1.3.0 open-source
- 8x M40 GPUs
- 200k Adam w learning_rate 0.3, decayed with
noam
schedule - 5120 words per batch, checkpoints created every 60 min
- results are reported on averaged parameters of last 20 checkpoints
- beam=8, length_penalty=1.0
- reported score using sacreBLEU v1.2.3
- BLEU Score
Experiment on Larger Corpus
- Data
- WMT17 18M + 35M/50M subset selected from 100M UN corpus
- use Cross-Entropy selection by Moore et al. 2010 and Axelrod et al. 2011
- use SentVect similarity filtering described in above Data Selection tab with threshold 0.2
- WMT17 18M + 35M/50M subset selected from 100M UN corpus
- Vocab : same
- Model
- Transformer Big with 8,192 hidden_size in conv-1 block (bigger than original Transformer Big)
- 300K Adam
- minibatch of 3,500 with 8 GPUs
- same beam, length_penalty and averaging param
- BLEU Score
Human Evaluation Results
- Ensembles (
Combo-4,5,6
) obtains human parity (equivalent score withReference-HT
)Reference-HT
are human translations without using online translation enginesReference-PE
are human post-edit output based on Google Translate resultsReference-WMT
are originalnewstest2017
reference released after WMT17Online-A-1710
: Microsoft Translator collected on Oct 2017Online-A-1710
: Google Translator collected on Oct 2017
Evaluation Campaigns
- Motivated to resolve issues with human evaluation processes
- Annotator variability :
what if same annotator provides different results on same data?
-> resolved by running three campaigns on same evaluation data, and has seen a near complete overlap - Data variability : conduct evaluations on completely different subsets of the test data. although, the test data may already be biased..
- Annotator variability :
Human Analysis
Personal Thoughts
- Complete NMT workflow from data selection upto human evaluation and error analysis
- impressed on intense experiments, but methods are biased toward MS ideas
- human evaluation comparison between baseline model and improved model would have been interesting, see how the error types had been reduced
Link : https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
Authors : Hassan et al. 2018