/ECNMT

Emergent Communication Pretraining for Few-Shot Machine Translation

Primary LanguagePython

ECNMT: Emergent Communication Pretraining for Few-Shot Machine Translation

This repository is the official PyTorch implementation of the following paper:

Yaoyiran Li, Edoardo Maria Ponti, Ivan Vulić, and Anna Korhonen. 2020. Emergent Communication Pretraining for Few-Shot Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020). LINK

This method is a form of unsupervised knowledge transfer in the absence of linguistic data, where a model is first pre-trained on artificial languages emerging from referential games and then fine-tuned on few-shot downstream tasks like neural machine translation.

Emergent Communication and Machine Translation

Dependencies

  • PyTorch 1.3.1
  • Python 3.6

Data

COCO image features are available in the sub-folder half_feats here. Preprocessed EN-DE (DE-EN) data for translation are available in the sub-folder task1 here. Both are obtained from Translagent.

Please find the data for translation in the other language pairs (EN-CS, EN-RO, EN-FR) in the links below.

Dictionaries Train Sentence Pairs Reference Translations
EN-CS & CS-EN EN-CS & CS-EN EN-CS & CS-EN
EN-RO & RO-EN EN-RO & RO-EN EN-RO & RO-EN
EN-FR & FR-EN EN-FR & FR-EN EN-FR & FR-EN

Pretrained Models for Emergent Communication

Source / Target Target / Source
EN DE
EN CS
EN RO
EN FR

Experiments

Step 1: run EC pretraining (otherwise go to Step 2 and use a pretrained model).

cd ./ECPRETRAIN
sh run_training.sh

Step 2: run NMT fine-tuning (please modify the roots for training data, pretrained model and saved path before).

cd ./NMT
sh run_training.sh

Optional: run baseline

cd ./BASELINENMT
sh run_training.sh

Citation

@inproceedings{YL:2020,
  author    = {Yaoyiran Li and Edoardo Maria Ponti and Ivan Vulić and Anna Korhonen},
  title     = {Emergent Communication Pretraining for Few-Shot Machine Translation},
  year      = {2020},
  booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
}

Acknowledgements

Part of the code is based on Translagent.

The datasets for our experiments include MS COCO for Emergent Communication pretraining, Multi30k Task 1 and Europarl for NMT fine-tuning. Text preprocessing is based on Moses and Subword-NMT.

Please cite these resources accordingly.