This repo implements Generalized Procrustes Analysis for (weakly) supervised bilingual dictionary induction over the original MUSE implementation. The approach is described in
Yova Kementchedjhieva, Sebastian Ruder, Ryan Cotterell and Anders Søgaard. Generalizing Procrustes Analysis for Better Bilingual Dictionary Induction. CoNLL. 2018.
To use GPA, set --generalized True
in supervised.py, and feed the languages as follows: set --src_lang
and --src_emb
to the name of and path to the main source language; set --tgt_lang
and --tgt_emb
to all other languages to be trained on, with the main target language coming last (format the list as a space-separated string). Example command for training EN to AF with support from DE:
python supervised.py --src_lang en --tgt_lang "de af" --src_emb data/wiki.en.vec --tgt_emb "data/wiki.de.vec data/wiki.af.vec" --generalized True --n_refinement 10 --dico_train identical_char --dico_max_rank 15000
With --generalized False
, simple Procrustes Analysis runs, as originally implemented in the code.
PyTorch 0.4
Below is the original README for MUSE.
MUSE is a Python library for multilingual word embeddings, whose goal is to provide the community with:
- state-of-the-art multilingual word embeddings based on fastText
- large-scale high-quality bilingual dictionaries for training and evaluation
We include two methods, one supervised that uses a bilingual dictionary or identical character strings, and one unsupervised that does not use any parallel data (see Word Translation without Parallel Data for more details).
- Python 2/3 with NumPy/SciPy
- PyTorch
- Faiss (recommended) for fast nearest neighbor search (CPU or GPU).
MUSE is available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".
Get monolingual and cross-lingual word embeddings evaluation datasets:
- Our 110 bilingual dictionaries
- 28 monolingual word similarity tasks for 6 languages, and the English word analogy task
- Cross-lingual word similarity tasks from SemEval2017
- Sentence translation retrieval with Europarl corpora
by simply running (in data/):
./get_evaluation.sh
Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.
For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.
You can download the English (en) and Spanish (es) embeddings this way:
# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.es.vec
This project includes two ways to obtain cross-lingual word embeddings:
- Supervised: using a train bilingual dictionary (or identical character strings as anchor points), learn a mapping from the source to the target space using (iterative) Procrustes alignment.
- Unsupervised: without any parallel data or anchor point, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement.
For more details on these approaches, please check here.
To learn a mapping between the source and the target space, simply run:
python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default
By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.
To learn a mapping using adversarial training and iterative Procrustes refinement, run:
python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5
By default, the validation metric is the mean cosine of word pairs from a synthetic dictionary built with CSLS (Cross-domain similarity local scaling). For some language pairs (e.g. En-Zh),
we recommend to center the embeddings using --normalize_embeddings center
.
We also include a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:
Monolingual
python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000
Cross-lingual
python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000
By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt
. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth
to export the embeddings in a PyTorch binary file, or simply disable the export (--export ""
).
When loading embeddings, the model can load:
- PyTorch binary files previously generated by MUSE (.pth files)
- fastText binary files previously generated by fastText (.bin files)
- text files (text file with one word embedding per line)
The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.
We provide multilingual embeddings and ground-truth bilingual dictionaries.
We release fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space.
Arabic: text | Bulgarian: text | Catalan: text | Croatian: text | Czech: text | Danish: text |
Dutch: text | English: text | Estonian: text | Finnish: text | French: text | German: text |
Greek: text | Hebrew: text | Hungarian: text | Indonesian: text | Italian: text | Macedonian: text |
Norwegian: text | Polish: text | Portuguese: text | Romanian: text | Russian: text | Slovak: text |
Slovenian: text | Spanish: text | Swedish: text | Turkish: text | Ukrainian: text | Vietnamese: text |
You can visualize crosslingual nearest neighbors using demo.ipynb.
We created 110 large-scale ground-truth bilingual dictionaries using an internal translation tool. The dictionaries handle well the polysemy of words. We provide a train and test split of 5000 and 1500 unique source words, as well as a larger set of up to 100k pairs. Our goal is to ease the development and the evaluation of cross-lingual word embeddings and multilingual NLP.
European languages in every direction
src-tgt | German | English | Spanish | French | Italian | Portuguese |
---|---|---|---|---|---|---|
German | - | full train test | full train test | full train test | full train test | full train test |
English | full train test | - | full train test | full train test | full train test | full train test |
Spanish | full train test | full train test | - | full train test | full train test | full train test |
French | full train test | full train test | full train test | - | full train test | full train test |
Italian | full train test | full train test | full train test | full train test | - | full train test |
Portuguese | full train test | full train test | full train test | full train test | full train test | - |
Other languages to English (e.g. {fr,es}-en)
English to other languages (e.g. en-{fr,es})
Please cite [1] if you found the resources in this repository useful.
[1] A. Conneau*, G. Lample*, L. Denoyer, MA. Ranzato, H. Jégou, Word Translation Without Parallel Data
* Equal contribution. Order has been determined with a coin flip.
@article{conneau2017word,
title={Word Translation Without Parallel Data},
author={Conneau, Alexis and Lample, Guillaume and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
journal={arXiv preprint arXiv:1710.04087},
year={2017}
}
MUSE is the project at the origin of the work on unsupervised machine translation with monolingual data only [2].
[2] G. Lample, L. Denoyer, MA. Ranzato Unsupervised Machine Translation With Monolingual Data Only
@article{lample2017unsupervised,
title={Unsupervised Machine Translation Using Monolingual Corpora Only},
author={Lample, Guillaume and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
journal={arXiv preprint arXiv:1711.00043},
year={2017}
}
Contact: gl@fb.com aconneau@fb.com