/yue_nmt

Python scripts and datasets of the "Extremely Low-Resource Neural Machine Translation: A Case Study of Cantonese" project

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Extremely Low-Resource Neural Machine Translation: A Case Study of Cantonese

Result | Data | Parallel Sentence Mining | Model Training

This repo provides the implentation scripts in the project, as well as the synthetic data generated via bitext mining. You can find the paper here.

The development of NLP applications for Cantonese, a language with over 85 million speakers, is lagging compared to other languages with a similar number of speakers. This project is, to my best knowledge, the first benchmark of multiple neural machine translation (NMT) systems of Cantonese. Secondly, I performed parallel sentence mining as data augmentation for the extremely low resource language pair (Cantonese-Mandarin) and increased the number of sentence pairs by 3480% (1,002 to 35,877). Results show that with the parallel sentence mining technique, the best performing model (BPE-level bidirectional LSTM) scored 11.98 BLEU better than the vanilla baseline and 9.93 BLEU higher than my strong baseline. Thirdly, I evaluated the quality of the translated texts using modern texts and historical texts to investigate the models' ability to translate historical texts. Finally, I provide the first large-scale parallel training dataset of the language pair (post-sentence mining) as well as an evaluation dataset comprising Cantonese, Mandarin, and Literary Chinese for future research.

Key implementations in the project

  1. Data augumentation via Parallel Sentence Mining (PSM)
  2. NMT models training
    1. Bidirectional LSTM (BiLSTM) MT
      • word represenation
      • BPE represenation (highest BLEU score)
    2. Transoformer MT
      • word represenation
      • BPE represenation (best translation quality)
    3. Unsupervised NMT via Language Model Pre-training and Transfer Learning

*Note: The script of finetuning mBART can be found here; however it should be noted that this approach failed to perform on the unseen language (Cantonese) and resulted in a 0 BLEU score.

Result

Model SacreBLEU
BiLSTM (Vanilla baseline) 1.24
BiLSTMt (Strong baseline) 3.29
BiLSTMt +PSM 12.37
BiLSTMbpe +PSM 13.22
Transformerword +PSM 3.56
Transformerbpe +PSM 11.66
RELMadap + PSM 1.85

Pretrained models

Parallel Sentence Mining

The scripts for Parallel Sentence Mining (PSM) (also known as bitext mining) can be found here. It will perform PSM from Wikipedia backup files, concatenate the UD data & the synthetic dataset, and finally generate a pickle file of the combined dataset.

Data

Model Data Size (Sentence pair) Ratio (Train/Validation/Test)
Baseline Cantonese and Mandarin Chinese Parallel Corpus (UD) 1,002 80/10/10
Experimental models UD+PSM 35,877 68/15/17

NMT Model Training

Preliminary

  1. Clone current repo
git clone https://github.com/evelynkyl/yue_nmt
  1. Split data into training, evaluation, and test sets
mkdir /yue_nmt/bitext_mining/data/bitext_and_ud/split
python3 split_data.py
  1. Install dependencies for model training
# apex (for fp16 training)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir ./
cd ~

# faiseq (for machine translation)
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ~

*Note: It is HIGHLY recommended to use half precision (using Apex) by simply adding --fp16 True --amp 1 flags to each running command. Without it, you might run out of memory.

Implentation

BiLSTM or Transformer (via JoeyNMT)

Scripts of the model parameters can be found in /yue_nmt/scripts/training. To train a model, run the command below

python3 -m joeynmt train {config.yaml}

Unsupervised NMT by Transfer Learning (via RELM)

Please refer to UNMT_via_RELM.

Evaluation

BiLSTM or Transformer (via JoeyNMT)

Perform evaluation of the model on the test set. This will return the sacrebleu score on the validation and test set based on the highest validation score the model got during training.

python3 -m joeynmt test {modelname_config.yaml} --output_path /yue_nmt/models/modelname_predictions

Inference

BiLSTM or Transformer (via JoeyNMT)

# file translation
python3 -m joeynmt translate {modelname_config.yaml} < literary_goldref_zh_bpe.txt --output_path eval_literary_translated_yue.txt

License

Yue_NMT is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Acknowledgement