kweonwooj/papers

Word Translation without Parallel Data

Opened this issue · 0 comments

Abstract

  • propose a training framework that builds a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way
  • outperforms existing supervised methods on cross-lingual tasks for some language pairs

Details

Introduction

  • Word Embedding
    • word2vec was proposed by Mikoliv et al 2013a for learning distributed representation of words in unsupervised manner
    • Levy & Goldberg et al 2014 showed that the skip-gram with negative sampling method of word2vec amounts to factorizing a word-context co-occurrence matrix whose entires are point-wise mutual information of the respective word and context pairs
  • Cross-lingual Embedding w Parallel vocab
    • initial study on cross-lingual embedding started with Mikolov et al 2013b noticing continuous word embedding spaces exhibit similar structure across languages, and proposed to learn a linear mapping from source to target embedding space, using parallel vocabulary of 5k words as anchor points
  • Cross-lingual Embedding w/o Parallel vocab
    • Smith et al 2017 employ identical character strings to form a parallel vocab, limited to languages sharing common alphabet
    • Cao et al 2016 employ distribution-based approach
    • Zhang et al 2017b employ adversarial training
    • above approaches sound appealing, but their performance is significantly below supervised methods
  • Contributions
    • propose learning SoTA cross-lingual embedding without parallel vocab in three tasks : word translation, sentence translation retrieval and cross-lingual word similarity
    • introduce cross-domain similarity adaptation method which improved the unsupervised method significantly by solving hubness problem (points tending to be nearest neighbors of many points in high-dimensional space)
    • propose an unsupervised criterion that is highly correlated with quality of the cross-lingual mapping, which can be used for early stopping and hyperparamter tuning
    • release high-quality dictionary for 12 oriented language pairs and open-source the code

Method

screen shot 2018-04-03 at 2 43 38 pm

  • Word Embedding
    • learn unsupervised word embeddings using fastText (300-dim) for source and target language
  • Adversarial Training
    • model : 2-layer FCN with hidden_size 2048 + Leaky-ReLU activation
    • learn GAN with discriminator trying to detect whether the embedding is from source or target, and generator is trying to fool discriminator
      screen shot 2018-04-03 at 2 43 54 pm
      screen shot 2018-04-03 at 2 43 57 pm
    • W is learnt to preserve orthogonality with update rule as below :
      screen shot 2018-04-03 at 2 53 07 pm
  • Refinement Procedure
    • GAN gives good performance, but not on par with supervised methods due to rare words hindering the overall performance
    • to refine, build a synthetic parallel vocab using W learned in GAN training on the fly - choose most frequent words and retain only mutual nearest neighbors to ensure high-quality dictionary
    • apply Procrustes solution on this generated dictionary for refinement, iteratively
  • Cross-Domain Similarity Local Scaling (CSLS)
    • to resolve hubness problem, consider a bi-partite neighborhood graph and consider the mean similarity of K nearest neighbor embeddings as :
      screen shot 2018-04-03 at 2 44 13 pm
    • overall similarity measure (CSLS) is measured as
      screen shot 2018-04-03 at 2 44 10 pm
  • Unsupervised Criterion
    • consider 10k most frequent source words, use CSLS to generate translation for each of them and compute average cosine similarity and use this average as validation metric
    • this criterion correlates well with the performance of the evaluation task than Wassertein distance
      screen shot 2018-04-03 at 2 53 17 pm

Experiments

  • Word Translation
    • applying Procrustes - CSLS in supervised manner outperforms other supervised methods
    • unsupervised method proposed in this paper outperforms SoTA in P@1
    • when word embedding is trained via Wiki (richer embedding), performance boosts up
      screen shot 2018-04-03 at 4 52 35 pm
  • Sentence Retrieval
    • both supervised and unsupervised method achieves SoTA
      screen shot 2018-04-03 at 4 54 34 pm

Personal Thoughts

  • engineering effort to push the performance of unsupervised method to surpass that of supervised method is impressive.
  • still, word is word and sentence is sentence. I'd like to see how to relate this cross-lingual word embedding to sentence-level context

PDF presetned in OpenNMT WorkshopParis2018
Link : https://arxiv.org/pdf/1710.04087.pdf
Authors : Conneau et al. 2018