Here, I use a virtual environment but Conda should be very similar.
- Create a virtual environment with Python-3
python3 -m venv [PATH]
- Activate the environment
source [PATH]/bin/activate
- Clone the code
git clone https://github.com/rasoolims/zero-shot-mt
cd zero-shot-mt/src
- Install requirements In my experiments, I used cuda 10.1 and cudnn 7. To replicate the results, please use the mentioned versions. If things do not work as expected, please use the Docker installation.
python3 -m pip install --upgrade pip
pip install -r requirements.txt
It might be the case that pyicu
does not work properly. Follow its instructions to install it.
Asuuming that Docker and NVIDIA docker is installed., follow the following steps:
- Download the repository and pretrained models
git clone https://github.com/rasoolims/zero-shot-mt
- Build the docker in command line:
docker build dockers/gpu/ -t [docker-name] --no-cache
- Start running the docker:
- Run this with screen since training might take a long time.
docker run --gpus all -it [docker-name]
Throughout this guideline, I use the small files in the sample folder. Here the Persian and English files are parallel but the Arabic text is not!
WARNING: Depending on data, the best parameters might significantly differ. It is good to try some parameter tuning for finding the best setting.
python3 scripts/icu_transliterate.py sample/fa.txt sample/fa.tr.txt
Then, we concatenate the three files. Note that this could be any number of files or languages more than or equal to two.
cat sample/en.txt sample/fa.tr.txt > sample/all.txt
Now we are ready to train a tokenizer:
python train_tokenizer.py --data sample/all.txt --vocab [vocab-size] --model sample/tok
The vocab size could be any value. Anything between 30000 to 100000 should be good. For this sample file, try 1000.
Parallel data could be gold-standard or mined. You should load pre-trained MASS models for the best performance.
1. Create binary files for training and dev dataset: For simplicity, we use the Persian and English text files as both training and development datasets by using their last 100 sentences as development data.
head -9900 sample/fa.txt > sample/train.fa
head -9900 sample/en.txt > sample/train.en
tail -100 sample/en.txt > sample/dev.en
tail -100 sample/fa.txt > sample/dev.fa
head -9900 sample/fa.tr.txt > sample/train.tr.fa
tail -100 sample/fa.tr.txt > sample/dev.tr.fa
python create_mt_batches.py --tok sample/tok/ --src sample/train.fa \
--dst sample/train.en --srct sample/train.tr.fa\
--output sample/fa2en.train.mt
python create_mt_batches.py --tok sample/tok/ --src sample/dev.fa \
--dst sample/train.en --srct sample/dev.tr.fa \
--output sample/fa2en.dev.mt
If you create translation data in multiple direction, you can train multilingual translation for which we learn translation from multiple directions. Multiple data files can be separated by ,
in the arguments both for --train
and --dev
options.
2. Train machine translation:
CUDA_VISIBLE_DEVICES=0 python3 -u train_mt.py --tok sample/tok/ \
--model sample/mt_model --train sample/fa2en.train.mt \
--capacity 600 --batch 4000 --beam 4 --step 500000 --warmup 4000 \
--lr 0.0001 --dev sample/fa2en.dev.mt \
--dropout 0.1 --multi
After you are done, you can use the model path sample/mt_model
for translating text to English (similar to the section on using the pretrained models in our paper.
3. Translate:
CUDA_VISIBLE_DEVICES=0 python -u translate.py --tok sample/tok/ \
--model sample/mt_model --input sample/dev.fa --input2 sample/dev.tr.fa\
--output sample/dev.output.en
Note that there is a --verbose
option where it puts the input and output lines separated by |||
. This is useful especially if you want to use it for back-translation (to make sure that sentence alignments are completely guaranteed), or for annotation projection in which you might need it for word alignment.