Training MLM
Closed this issue · 3 comments
I have followed the preprocessing step for both monolingual
and monolingual_functions
. The generated files in XLM-syml
are as follows:
- train.[cpp | java | python]_cl.[0..NGPU].pth
- train.[cpp | java | python]_monolingual.[0..NGPU].pth
- train.[cpp | java | python]_sa.[0..NGPU].pth
- test.[cpp | java | python]_cl.pth
- test.[cpp | java | python]_monolingual.pth
- test.[cpp | java | python]_sa.pth
- valid.[cpp | java | python]_cl.pth
- valid.[cpp | java | python]_monolingual.pth
- valid.[cpp | java | python]_sa.pth
However, whenever I start the training using the script in the README (copied below), I get the following the file not found
error. It seems to me that the script is looking for different files.
Error
File "/CodeGen/codegen_sources/model/train.py", line 697, in <module> check_data_params(params) File "/CodeGen/codegen_sources/model/src/data/loader.py", line 470, in check_data_params assert all( AssertionError: [['/.../transcoder_data/train_data_small/XLM-syml/train.java.pth', '/.../transcoder_data/train_data_small/XLM-syml/valid.java.pth', '/.../transcoder_data/train_data_small/XLM-syml/test.java.pth'], ['/.../transcoder_data/train_data_small/XLM-syml/train.python.pth', '/.../transcoder_data/train_data_small/XLM-syml/valid.python.pth', '/.../transcoder_data/train_data_small/XLM-syml/test.python.pth']]
Training Scripts
python3 -m torch.distributed.launch --nproc_per_node=$NGPU codegen_sources/model/train.py \
--exp_name mlm \
--dump_path '/.../transcoder_data/train_data_small_dump' \
--data_path '/.../transcoder_data/train_data_small/XLM-syml' \
--mlm_steps 'java,python' \
--add_eof_to_stream true \
--word_mask_keep_rand '0.8,0.1,0.1' \
--word_pred '0.15' \
--encoder_only true \
--n_layers 12 \
--emb_dim 768 \
--n_heads 12 \
--lgs 'java-python' \
--max_vocab 64000 \
--gelu_activation true \
--roberta_mode true \
--amp 2 \
--fp16 true \
--batch_size 32 \
--bptt 512 \
--epoch_size 100000 \
--max_epoch 100000 \
--split_data_accross_gpu global \
--optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' \
--save_periodic 0 \
--validation_metrics _valid_mlm_ppl \
--stopping_criterion '_valid_mlm_ppl,10'
It seems to me that both monolingual and monolingual_functions add a suffix to the training files that are not appropriately set in other parts of training pipeline or maybe I am missing to set some flags or something.
Thanks
Hi.
We tried to make the data preprocessing and the training pipelines independent so we don't set flags outside of the parameters for training.
If you want to train your MLM on the monolingual dataset (what we did for TransCoder), you either to rename/create symlinks to have train.[cpp | java | python].[0..NGPU].pth
files with the content of the monolingual .pth files or to set:
--lgs 'java_monolingual-python_monolingual' \
--mlm_steps 'java_monolingual,python_monolingual' \
Thanks Baptiste.