Word Translation without Parallel Data
Opened this issue · 0 comments
kweonwooj commented
Abstract
- propose a training framework that builds a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way
- outperforms existing supervised methods on cross-lingual tasks for some language pairs
Details
Introduction
- Word Embedding
word2vec
was proposed by Mikoliv et al 2013a for learning distributed representation of words in unsupervised manner- Levy & Goldberg et al 2014 showed that the skip-gram with negative sampling method of
word2vec
amounts to factorizing a word-context co-occurrence matrix whose entires are point-wise mutual information of the respective word and context pairs
- Cross-lingual Embedding w Parallel vocab
- initial study on cross-lingual embedding started with Mikolov et al 2013b noticing continuous word embedding spaces exhibit similar structure across languages, and proposed to learn a linear mapping from source to target embedding space, using parallel vocabulary of 5k words as anchor points
- Cross-lingual Embedding w/o Parallel vocab
- Smith et al 2017 employ identical character strings to form a parallel vocab, limited to languages sharing common alphabet
- Cao et al 2016 employ distribution-based approach
- Zhang et al 2017b employ adversarial training
- above approaches sound appealing, but their performance is significantly below supervised methods
- Contributions
- propose learning SoTA cross-lingual embedding without parallel vocab in three tasks : word translation, sentence translation retrieval and cross-lingual word similarity
- introduce cross-domain similarity adaptation method which improved the unsupervised method significantly by solving
hubness
problem (points tending to be nearest neighbors of many points in high-dimensional space) - propose an unsupervised criterion that is highly correlated with quality of the cross-lingual mapping, which can be used for early stopping and hyperparamter tuning
- release high-quality dictionary for 12 oriented language pairs and open-source the code
Method
- Word Embedding
- learn unsupervised word embeddings using fastText (300-dim) for source and target language
- Adversarial Training
- Refinement Procedure
- GAN gives good performance, but not on par with supervised methods due to rare words hindering the overall performance
- to refine, build a synthetic parallel vocab using
W
learned in GAN training on the fly - choose most frequent words and retain only mutual nearest neighbors to ensure high-quality dictionary - apply Procrustes solution on this generated dictionary for refinement, iteratively
- Cross-Domain Similarity Local Scaling (CSLS)
- Unsupervised Criterion
Experiments
- Word Translation
- Sentence Retrieval
Personal Thoughts
- engineering effort to push the performance of unsupervised method to surpass that of supervised method is impressive.
- still, word is word and sentence is sentence. I'd like to see how to relate this cross-lingual word embedding to sentence-level context
PDF presetned in OpenNMT WorkshopParis2018
Link : https://arxiv.org/pdf/1710.04087.pdf
Authors : Conneau et al. 2018