facebookresearch/CodeGen

Training MLM

Closed this issue · 3 comments

I have followed the preprocessing step for both monolingual and monolingual_functions. The generated files in XLM-syml are as follows:

  • train.[cpp | java | python]_cl.[0..NGPU].pth
  • train.[cpp | java | python]_monolingual.[0..NGPU].pth
  • train.[cpp | java | python]_sa.[0..NGPU].pth
  • test.[cpp | java | python]_cl.pth
  • test.[cpp | java | python]_monolingual.pth
  • test.[cpp | java | python]_sa.pth
  • valid.[cpp | java | python]_cl.pth
  • valid.[cpp | java | python]_monolingual.pth
  • valid.[cpp | java | python]_sa.pth

However, whenever I start the training using the script in the README (copied below), I get the following the file not found error. It seems to me that the script is looking for different files.

Error

File "/CodeGen/codegen_sources/model/train.py", line 697, in <module> check_data_params(params) File "/CodeGen/codegen_sources/model/src/data/loader.py", line 470, in check_data_params assert all( AssertionError: [['/.../transcoder_data/train_data_small/XLM-syml/train.java.pth', '/.../transcoder_data/train_data_small/XLM-syml/valid.java.pth', '/.../transcoder_data/train_data_small/XLM-syml/test.java.pth'], ['/.../transcoder_data/train_data_small/XLM-syml/train.python.pth', '/.../transcoder_data/train_data_small/XLM-syml/valid.python.pth', '/.../transcoder_data/train_data_small/XLM-syml/test.python.pth']]

Training Scripts

python3 -m torch.distributed.launch --nproc_per_node=$NGPU codegen_sources/model/train.py \
--exp_name mlm \
--dump_path '/.../transcoder_data/train_data_small_dump' \
--data_path '/.../transcoder_data/train_data_small/XLM-syml' \
--mlm_steps 'java,python' \
--add_eof_to_stream true \
--word_mask_keep_rand '0.8,0.1,0.1' \
--word_pred '0.15' \
--encoder_only true \
--n_layers 12  \
--emb_dim 768  \
--n_heads 12  \
--lgs 'java-python' \
--max_vocab 64000 \
--gelu_activation true \
--roberta_mode true \
--amp 2  \
--fp16 true  \
--batch_size 32 \
--bptt 512 \
--epoch_size 100000 \
--max_epoch 100000 \
--split_data_accross_gpu global \
--optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' \
--save_periodic 0 \
--validation_metrics _valid_mlm_ppl \
--stopping_criterion '_valid_mlm_ppl,10'

It seems to me that both monolingual and monolingual_functions add a suffix to the training files that are not appropriately set in other parts of training pipeline or maybe I am missing to set some flags or something.

Thanks

Hi.
We tried to make the data preprocessing and the training pipelines independent so we don't set flags outside of the parameters for training.
If you want to train your MLM on the monolingual dataset (what we did for TransCoder), you either to rename/create symlinks to have train.[cpp | java | python].[0..NGPU].pth files with the content of the monolingual .pth files or to set:

--lgs 'java_monolingual-python_monolingual' \
--mlm_steps 'java_monolingual,python_monolingual' \

Thanks Baptiste.