This code implements a simple beam search where cross-lingual word embedding is combined with a language model. It is compatible with MUSE embeddings and kenlm language models. The output translation can be further fed to a denoising autoencoder for improved reordering.
If you use this code, please cite:
- Yunsu Kim, Jiahui Geng and Hermann Ney. Improving Unsupervised Word-by-Word Translation Using Language Model and Denoising Autoencoder. EMNLP 2018.
- Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer and Hervé Jégou. Word Translation Without Parallel Data. arXiv preprint.
If you are looking for the denoising autoencoder, please go to sockeye-noise.
First, please install all dependencies:
- Python 2/3 with NumPy/SciPy
- PyTorch
- Faiss (recommended) for fast nearest neighbor search (CPU or GPU).
- kenlm (with Python bindings)
Then clone this repository.
Here is a simple example for translation:
> cat {input_corpus} | python translate.py --src_emb {source_embedding} \
--tgt_emb {target_embedding} \
--emb_dim {embedding_dimension} \
--lm {language_model} > {output_translation}
Please refer to help message (-h
) for other detailed options.