Fairseq scripts that can use to train NMT with various models and tokenizing (bpe, sentencepiece)
- Prepare a parallel corpus (raw data)
- Write data parsing and cleansing code in "make_nmt_data".
- Clone submodule (fairseq, subword-nmt)
git submodule init
git submodule update
- Install sentencepiece
pip install sentencepiece
- Execute scripts below
bash 1_prepare_nmt_data.sh [raw_data] [train_data_path]
bash 2_tokenization_bpe.sh [train_data_path] [number of operation]
bash 3_fairseq_preprocess.sh [train_data_path] --bpe
bash 4_fairseq_train.sh [train_data_path] transformer --bpe
bash 5_fairseq_generate.sh [train_data_path] transformer --bpe
bash 1_prepare_nmt_data.sh [raw_data] [train_data_path]
bash 2_tokenization_sentencepiece.sh [train_data_path] [vocab_size]
bash 3_fairseq_preprocess.sh [train_data_path] --sp [vocab_size]
bash 4_fairseq_train.sh [train_data_path] transformer --sp [vocab_size]
bash 5_fairseq_generate.sh [train_data_path] transformer --sp [vocab_size]