This is a library to use our multilingual representations. The details are described in the paper Inducing Language-Agnostic Multilingual Representations
CURRENT VERSION:
- We provide the modifications to m-BERT and XLM-R, namely joint-align, space normalization and removing word contractions (other two text normalizations will be updated in next version).
- We provide the machine translated datasets on the RFEval and XNLI tasks.
- The pre-trained m-BERT and XLM-R models are re-aligned on word translations corpus extracted from Europarl and JW300, both of which support 19 languages: English, German, Portuguese, Dutch, Indonesian, Italian, French, Spanish, Hungarian, Afrikaans, Malay, Tagalog, Javanese, Bengali, Marathi, Estonian, Hindi, Urdu, Finnish.
- Python 3.6
- PyTorch, tested with 1.3.1
- NumPy, tested with 1.18.4
- spacy-udpipe, tested with 0.1.0
- Pyemd, fast earth mover distance, tested with 0.5.1
- Transformers, tested with 2.7.0
- [Mosestokenizer] tokenization from the Moses encoder, tested with 1.0.0
The models are in the link. It is tricky to use shell command to deal with the datasets in Google Drive. So, please kindly download them mannually and move them to the 'model' folder. The RFEval datasets are in the repo already. Please download XNLI datasets in the link and move to the 'XNLI/dataset' folder.
For RFEval, we provide experiments for m-BERT and XLM-R and compare the results of three modifications in one go. For XNLI, you need to fine-tune the models on XNLI so as to complete the experiments. The fine-tuned models are not released in this version.
cd [TASK]
python [MODEL]_eval.py
Modifications | m-BERT (fr-en) | XLM-R (fr-en) | m-BERT (de-en) | XLM-R (de-en) |
---|---|---|---|---|
orginal | 38.2 | 16.8 | 28.3 | 14.8 |
norm_space | 42.4 | 43.7 | 31.8 | 35.5 |
norm_text | 42.7 | 15.0 | 30.1 | 14.3 |
align | 46.1 | 27.6 | 37.0 | 19.3 |
align + norm_space | 47.6 | 47.6 | 39.4 | 36.9 |
align + norm_space + norm_text | 52.0 | 47.7 | 40.8 | 36.8 |
We observe that space normalization can achieve the most consistent gains among the three modifications across encoders, tasks and languages. Also, we see the additive effects of the three are often considerable.
This project is Apache-licensed, as found in the LICENSE
file in the root directory of this source tree.
@article{zhao2020inducing,
title={Inducing Language-Agnostic Multilingual Representations},
author={Zhao, Wei and Eger, Steffen and Bjerva, Johannes and Augenstein, Isabelle},
journal={arXiv preprint arXiv:2008.09112},
year={2020}
}
@inproceedings{zhao-etal-2020-limitations,
title = "On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation",
author = "Zhao, Wei and Glava{\v{s}}, Goran and Peyrard, Maxime and Gao, Yang and West, Robert and Eger, Steffen",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.151",
pages = "1656--1671"
}