/Laboro-ParaCorpus

Scripts for creating a Japanese-English parallel corpus and training NMT models

Primary LanguagePython

Laboro ParaCorpus

Introduction

We are happy to announce that we've made public our web-based English-Japanese parallel corpus. More information on how we created the corpus can be found in this article. The document here mainly focuses on the implementation.

To reproduce our experiments, please follow three steps,

  1. select candidate domains for crawling the web
  2. crawling and alignment for generating the parallel corpus
  3. training and evaluating NMT models for evaluating the quatlity of the parallel corpus

In addition, in the last part of this document, we present the complete BLEU scores comparison between several NMT models on 7 evaluation datasets. Take a look if you are interested!

Download

Laboro-ParaCorpus

Laboro-ParaCorpus
Base EN-JA
Base JA-EN
Big EN-JA
Big JA-EN

Laboro-ParaCorpus+

Laboro-ParaCorpus+
Base EN-JA
Base JA-EN
Big EN-JA
Big JA-EN

To Cite This Work

We haven't published any paper on this work. Please cite this repository:

@article{Laboro-ParaCorpus,
  title = {Laboro-ParaCorpus: A Web-Based Japanese-English Parallel Corpus},
  author = {"Zhao, Xinyi and Hamamoto, Masafumi and Fujihara, Hiromasa"},
  year = {2021},
  howpublished = {\url{https://github.com/laboroai/Laboro-ParaCorpus}}
}

License

CC0

The parallel corpus itself is licensed under a Public Domain CC0 License. You may use the corpus without restriction under copyright or database law, even for commercial use!

Creative Commons License

The NMT models are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
For commercial use, please contact Laboro.AI Inc.

Step 1. Select Candidate Domains

Please refer to the document here.

Step 2. Crawling and Alignment

Requirements

Ubuntu 18.04
Python 3.6.9
Bitextor v7.3.2
Bicleaner v0.14
MeCab 0.996
mecab-ipadic-NEologd

To make Bitextor suit Japanese characters and punctuations better, we made some modifications to the source code. To use the modified code, please run srcipt ./src/bitextor/set-up-bitextor.sh to replace the original source code in Bitextor.

Necessary Resources

  1. enough storage for crawling and post-processing

How much storage space would be enough depends on how many domains you plan to crawl and how much contents each domain contains. For reference, we used a bit less than 2TB storage for 50,000 domains.

  1. an English-Japanese vocabulary dictionary

We crawled an English-Japanese vacabulary dictionary from several dictionary websites, and ended up collecting 82,711 entries. It is important to select multiple sources to crawl the dictionary in order to balance the language style, because we want our final corpus to contain a little bit of everything, both academic and casual text.

  1. a trained Biclaner

The very detailed explanation for training a Bicleaner can be found here, according to which, two extra parallel corpora are needed. This includes a big corpus to extract probabilistic dictionary and word frequency information, and a small but high-quality corpus as the training corpus. Note that the dictionary used in the previous alignment step cannot be used here, because it doesn't contain the probability and word frequency information required in the training process.

Similarly, we crawled the big corpus from a bunch of dictionary websites with bilingual example sentences. As for the small but clean training corpus, we used about 600K sentence pairs from Reijiro corpus.

Configuration

Bitextor uses YAML format for the configuration files. By modifying the configuration files, we are able to control the pipeline and select the tools. Detailed instruction can be found on the GitHub homepage of Bitextor.

We provide some examples for generating configuration files in src/gen_config/. To use your dictionary and Bicleaner model, and to place your output in the proper location, please change the paths in the sample code.

# to generate configuration files that stop the pipeline after crawling
python3 src/gen_config/gen_yaml_only_crawl.py
# to generate configuration files that finish the complete pipeline
python3 src/gen_config/gen_yaml_complete.py

Start Bitextor Pipeline

To run a single process,

conf_dir='/home/ubuntu/data/post_conf/'
index=00001
/home/ubuntu/bitextor/bitextor.sh -j 1 -s $conf_dir'cc_1830_'$i'.yaml'

We also provide an example of running 330 processes at the same time. Each of the processes has 151 or 152 domains in queue.

bash src/run_bitextor/multiprocess_run.sh

The output from each domain will be saved as file en-ja.sent.xz in permanentDir that was set in the configuration file. To concatenate them into the final output, please run the following commands.

permanentDir=...
OUTPUT_PATH=...
rm $OUTPUT_PATH
for FILE in ${permanentDir}/en-ja.sent.xz; do
cat $FILE >> $OUTPUT_PATH
done

The Appending Filter

To further clean the corpus, we appended a strict rule-based filter at the end of the Bitextor pipeline. The filter removes those sentences pairs whose URL pairs doesn't follow the rules. The rules include

  1. the URL pairs must contain at least one language identifier including "ja", "en", "=j", etc;
  2. the numbers in the URLs, if exist, are usually the date or post ID, and are asked to be identical in a URL pair.

To use the filter, please run

python3 src/url_filter/url_filter.py

Step 3. Training and Evaluating NMT Models

To evaluate and compare the quality of the parallel corpora, we trained several sets of NMT models. The first set of models is trained with our final corpus. To explore how much the performance is influenced by the additional corpus, especially when it's a small corpus, the second set is trained with the combination of our corpus and an HNK daily conversation corpus. The NHK corpus is also crawled from online resources and contains only around 60K sentence pairs. In addition to that, the third set is trained with the combination of our corpus and NTT's JParaCrawl corpus. Each set of NMT models includes 4 models,

  1. base model, from English to Japanese
  2. base model, from Japanese to English
  3. big model, from English to Japanese
  4. big model, from Japanese to English

Setup & Preparation

We used sentencepiece to train the tokenizers, and then used Fairseq as the tool to train and evaluate NMT models based on the parallel corpus we created.

All scripts related to training and evaluating NMT models are placed in ./nmt/expe1 folder. We recommend to create a new experiment folder every time a NMT model is trained on a new corpus. Please place the original corpus into ./nmt/expe1/corpus/[name_of_the_corpus]/, and then use the following scripts to split it into English and Japanese corpora. Laboro-ParaCorpus comes with tokenizers, so you don't have to train them again, although we also provide the script for training sentencepiece tokenizers for your own parallel corpus.

# split the corpus into English and Japanese plain text
bash nmt/expe1/src/preprocess/split_text.sh

# train tokenizers if needed
# bash nmt/expe1/src/preprocess/train_tokenizer.sh

After the steps above, the experiment folder will be ready for training NMT models, and it should contain files as shown below. The original corpus won't be used again for NMT training and evaluation, so it's OK to delete the file to save some storage space.

├── corpus
│   └── Laboro-ParaCorpus
│       ├── Laboro-ParaCorpus.en
│       ├── Laboro-ParaCorpus.ja
│       └── Laboro-ParaCorpus.txt
├── tokenizer
│   └── spm
│       ├── spm.en.model
│       ├── spm.en.vocab
│       ├── spm.ja.model
│       └── spm.ja.vocab
└── src
    ├── enja
    │   └── ...
    ├── jaen
    │   └── ...
    └── preprocess
        └── ...

Preprocessing

# tokenize and length filter training dataset
bash nmt/expe1/src/preprocess/preprocess_train_dataset.sh

# generate dummy validation corpus
mkdir -p nmt/expe1/corpus/dummy
echo en > nmt/expe1/corpus/dummy/dummy.en
echo ja > nmt/expe1/corpus/dummy/dummy.ja

Training NMT

To directly use our models for inference, please skip this step. To reproduce our experiment, it is neccesary to adjust the --update-freq argument in scripts ./nmt/expe1/src/jaen/fairseq_nmt_pretrain_[model-size]_novalid_[direction].sh.

For pre-training base models,

# GPU --update-freq
2 32
4 (default) 16
8 8

For pre-training big models,

# GPU --update-freq
2 80
4 (default) 40
8 20
# JA-EN fairseq preprocess train and dummy valid datasets
bash nmt/expe1/src/jaen/preprocess_fairseq_jaen_novalid.sh

# EN-JA fairseq preprocess train and dummy valid datasets
bash nmt/expe1/src/enja/preprocess_fairseq_enja_novalid.sh

# JA-EN base model training
bash nmt/expe1/src/jaen/fairseq_nmt_pretrain_base_novalid_jaen.sh

# JA-EN big model training
bash nmt/expe1/src/jaen/fairseq_nmt_pretrain_big_novalid_jaen.sh

# EN-JA base model training
bash nmt/expe1/src/enja/fairseq_nmt_pretrain_base_novalid_enja.sh

# EN-JA big model training
bash nmt/expe1/src/enja/fairseq_nmt_pretrain_big_novalid_enja.sh

Evaluating NMT Models

The datasets used for evaluation in our experiments are listed below.

  • ASPEC, Asian Scientific Paper Excerpt Corpus
  • JESC, Japanese-English Subtitle Corpus containing casual language, colloquialisms, expository writing, and narrative discourse
  • KFTT, Kyoto Free Translation Task that focuses on Wikipedia articles related to Kyoto
  • IWSLT 2017 TED.tst2015 used in IWSLT 2017 Evaluation Campaign, including TED talks scripts in both languages
  • Duolinguo STAPLE for the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education
  • Tatoeba corpus, a large collection of multilingual sentences and translations that keeps being updated by voluntary contributors; release v20190709 is used in our experiment
  • BSD, Business Scene Dialogue corpus containing Japanese-English business conversations

To create the test split for each dataset, please take a look at this jupyter notebook. And then preprocess the datasets in a similar way as training dataset.

# tokenize and length filter testing datasets
bash nmt/expe1/src/preprocess/preprocess_test_dataset.sh

# JA-EN fairseq preprocess test datasets
bash nmt/expe1/src/jaen/preprocess_fairseq_test_jaen.sh

# EN-JA fairseq preprocess test datasets
bash nmt/expe1/src/enja/preprocess_fairseq_test_enja.sh

Run the following scripts to evaluate corresponding model on all 7 datasets. The results will be placed in ./nmt/results/ folder.

# JA-EN base model evaluation
bash nmt/expe1/src/jaen/fairseq_nmt_generate_evaluate_jaen.sh

# JA-EN big model evaluation
bash nmt/expe1/src/jaen/fairseq_nmt_generate_evaluate_jaen_big.sh

# EN-JA base model evaluation
bash nmt/expe1/src/enja/fairseq_nmt_generate_evaluate_enja.sh

# EN-JA big model evaluation
bash nmt/expe1/src/enja/fairseq_nmt_generate_evaluate_enja_big.sh

NMT Models Comparison

Information of the Corpora
Corpus NTT-JParaCrawl Laboro-ParaCorpus Laboro-ParaCorpus+ Laboro-ParaCorpus-NTT
corpus size 2.4 G 1.6 G 1.6 G 4.0 G
# sentence pairs 8.8 M 14 M 14 M 23 M
# tokens (EN/JA) 254 M 228 M 163 M 148 M 164 M 149 M 420 M 376 M
BLEU Scores of EN-JA Base Models
Model NTT-JParaCrawl Laboro-ParaCorpus Laboro-ParaCorpus+ Laboro-ParaCorpus-NTT Google Cloud
Translate
PT FT PT FT PT FT PT FT PT
ASPEC 17.4 30.1 17.9 29.4 17.9 29.5 18.0 29.9 23.1
JESC 6.0 12.8 5.9 11.9 6.2 12.4 6.5 12.3 7.9
KFTT 14.7 27.8 14.2 27.6 14.2 27.7 15.0 28.8 14.9
IWSLT 11.3 13.9 10.5 13.6 10.2 14.0 11.3 13.9 14.8
Duolingo 47.8 - 42.0 - 41.2 - 46.9 - 55.4
Tatoeba 19.7 - 19.5 - 20.6 - 20.2 - 28.2
BSD 10.6 - 11.6 - 12.5 - 11.7 - 15.8
BLEU Scores of EN-JA Big Models
Model NTT-JParaCrawl Laboro-ParaCorpus Laboro-ParaCorpus+ Laboro-ParaCorpus-NTT Google Cloud
Translate
PT FT PT FT PT FT PT FT PT
ASPEC 19.4 31.1 18.8 30.8 18.7 30.9 20.2 31.2 23.1
JESC 6.5 13.1 6.0 13.1 6.3 12.9 6.6 13.8 7.9
KFTT 15.7 29.6 15.9 29.5 15.9 29.1 17.3 29.7 14.9
IWSLT 12.1 13.5 11.0 13.8 11.0 14.2 12.3 14.2 14.8
Duolingo 47.6 - 41.4 - 40.0 - 44.5 - 55.4
Tatoeba 21.1 - 20.1 - 22.3 - 21.3 - 28.2
BSD 11.5 - 12.6 - 13.8 - 12.3 - 15.8
BLEU Scores of JA-EN Base Models
Model NTT-JParaCrawl Laboro-ParaCorpus Laboro-ParaCorpus+ Laboro-ParaCorpus-NTT Google Cloud
Translate
PT FT PT FT PT FT PT FT PT
ASPEC 19.1 30.1 20.3 29.8 20.0 30.0 20.3 30.2 24.6
JESC 7.5 18.7 7.0 17.7 7.6 17.6 7.6 18.9 7.8
KFTT 15.0 26.8 14.5 26.4 14.6 26.4 15.6 26.7 19.1
IWSLT 11.7 18.1 12.0 18.0 12.2 18.1 12.1 18.1 13.1
Duolingo 41.7 - 39.7 - 38.5 - 41.6 - 42.2
Tatoeba 28.5 - 26.9 - 29.0 - 29.1 - 33.3
BSD 16.8 - 16.1 - 18.1 - 17.4 - 18.4
BLEU Scores of JA-EN Big Models
Model NTT-JParaCrawl Laboro-ParaCorpus Laboro-ParaCorpus+ Laboro-ParaCorpus-NTT Google Cloud
Translate
PT FT PT FT PT FT PT FT PT
ASPEC 20.1 31.0 19.9 30.7 20.3 30.2 20.6 30.9 24.6
JESC 7.5 19.7 7.6 18.4 8.3 19.0 8.5 19.4 7.8
KFTT 16.2 27.2 15.4 26.5 16.1 26.8 17.5 27.5 19.1
IWSLT 12.8 17.7 13.0 18.6 13.3 18.8 13.6 19.0 13.1
Duolingo 42.2 - 40.9 - 40.1 - 42.8 - 42.2
Tatoeba 31.4 - 29.1 - 32.0 - 31.7 - 33.3
BSD 17.3 - 17.5 - 19.4 - 19.2 - 18.4