Result | Data | Parallel Sentence Mining | Model Training
This repo provides the implentation scripts in the project, as well as the synthetic data generated via bitext mining. You can find the paper here.
The development of NLP applications for Cantonese, a language with over 85 million speakers, is lagging compared to other languages with a similar number of speakers. This project is, to my best knowledge, the first benchmark of multiple neural machine translation (NMT) systems of Cantonese. Secondly, I performed parallel sentence mining as data augmentation for the extremely low resource language pair (Cantonese-Mandarin) and increased the number of sentence pairs by 3480% (1,002 to 35,877). Results show that with the parallel sentence mining technique, the best performing model (BPE-level bidirectional LSTM) scored 11.98 BLEU better than the vanilla baseline and 9.93 BLEU higher than my strong baseline. Thirdly, I evaluated the quality of the translated texts using modern texts and historical texts to investigate the models' ability to translate historical texts. Finally, I provide the first large-scale parallel training dataset of the language pair (post-sentence mining) as well as an evaluation dataset comprising Cantonese, Mandarin, and Literary Chinese for future research.
- Data augumentation via Parallel Sentence Mining (PSM)
- NMT models training
- Bidirectional LSTM (BiLSTM) MT
- word represenation
- BPE represenation (highest BLEU score)
- Transoformer MT
- word represenation
- BPE represenation (best translation quality)
- Unsupervised NMT via Language Model Pre-training and Transfer Learning
- Bidirectional LSTM (BiLSTM) MT
*Note: The script of finetuning mBART can be found here; however it should be noted that this approach failed to perform on the unseen language (Cantonese) and resulted in a 0 BLEU score.
Model | SacreBLEU |
---|---|
BiLSTM (Vanilla baseline) | 1.24 |
BiLSTMt (Strong baseline) | 3.29 |
BiLSTMt +PSM | 12.37 |
BiLSTMbpe +PSM | 13.22 |
Transformerword +PSM | 3.56 |
Transformerbpe +PSM | 11.66 |
RELMadap + PSM | 1.85 |
The scripts for Parallel Sentence Mining (PSM) (also known as bitext mining) can be found here. It will perform PSM from Wikipedia backup files, concatenate the UD data & the synthetic dataset, and finally generate a pickle file of the combined dataset.
Model | Data | Size (Sentence pair) | Ratio (Train/Validation/Test) |
---|---|---|---|
Baseline | Cantonese and Mandarin Chinese Parallel Corpus (UD) | 1,002 | 80/10/10 |
Experimental models | UD+PSM | 35,877 | 68/15/17 |
- Clone current repo
git clone https://github.com/evelynkyl/yue_nmt
- Split data into training, evaluation, and test sets
mkdir /yue_nmt/bitext_mining/data/bitext_and_ud/split
python3 split_data.py
- Install dependencies for model training
# apex (for fp16 training)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir ./
cd ~
# faiseq (for machine translation)
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ~
*Note: It is HIGHLY recommended to use half precision (using Apex) by simply adding --fp16 True --amp 1 flags to each running command. Without it, you might run out of memory.
BiLSTM or Transformer (via JoeyNMT)
Scripts of the model parameters can be found in /yue_nmt/scripts/training. To train a model, run the command below
python3 -m joeynmt train {config.yaml}
Unsupervised NMT by Transfer Learning (via RELM)
Please refer to UNMT_via_RELM.
BiLSTM or Transformer (via JoeyNMT)
Perform evaluation of the model on the test set. This will return the sacrebleu score on the validation and test set based on the highest validation score the model got during training.
python3 -m joeynmt test {modelname_config.yaml} --output_path /yue_nmt/models/modelname_predictions
BiLSTM or Transformer (via JoeyNMT)
# file translation
python3 -m joeynmt translate {modelname_config.yaml} < literary_goldref_zh_bpe.txt --output_path eval_literary_translated_yue.txt
Yue_NMT is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.
- The UD dataset is downloaded from UD Cantonese based on the Universal Dependecies Project.
- The Literary-Modern Chinese evaluation dataset is manually translated based on Ancient-Modern Chinese Translation with a New Large Training Dataset.
- Our code of bitext mining is based on LASER.
- Our code of unsupverised NMT (RELM) is based on RELM.
- We used the awesome JoeyNMT for training some of the NMT models. We thank the authors for sharing their great work.