facebookresearch/muss

train model failed

akafen opened this issue · 11 comments

I change cluster "local" to "debug" in scripts/train_model.py and I run the command "python3 scripts/train_models.py' ,but fail
The error :

fairseq-train /home/liuyijiao/muss/resources/datasets/_d41b33752d58c3fa688aef596b98df2b/fairseq_preprocessed_complex-simple --task translation --source-lang complex --target-lang simple --save-dir /home/liuyijiao/muss/experiments/fairseq/slurmjob_DEBUG_139908269653632/checkpoints --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --update-freq 16 --arch mbart_large --dropout 0.3 --weight-decay 0.0 --clip-norm 0.1 --share-all-embeddings --no-epoch-checkpoints --save-interval 999999 --validate-interval 999999 --max-update 50000 --save-interval-updates 100 --keep-interval-updates 1 --patience 10 --max-sentences 64 --seed 708 --distributed-world-size 8 --distributed-port 11733 --fp16 --restore-file '/home/liuyijiao/muss/resources/models/mbart/model.pt' --task 'translation_from_pretrained_bart' --source-lang 'complex' --target-lang 'simple' --encoder-normalize-before --decoder-normalize-before --label-smoothing 0.2 --dataset-impl 'mmap' --optimizer 'adam' --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --min-lr -1 --total-num-update 40000 --attention-dropout 0.1 --weight-decay 0.0 --max-tokens 1024 --update-freq 2 --log-format 'simple' --log-interval 2 --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler --langs 'ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN' --layernorm-embedding --ddp-backend 'no_c10d'
usage: train_models.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
[--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
[--fp16] [--memory-efficient-fp16]
[--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {sentence_ranking,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,sentence_prediction,cross_entropy,ctc,legacy_masked_lm_loss,masked_lm,adaptive_loss,nat_loss,composite_loss,wav2vec,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,moses,space}]
[--bpe {byte_bpe,subword_nmt,sentencepiece,gpt2,characters,bert,hf_byte_bpe,bytes,fastbpe}]
[--optimizer {sgd,adagrad,nag,adadelta,lamb,adafactor,adamax,adam}]
[--lr-scheduler {inverse_sqrt,tri_stage,reduce_lr_on_plateau,triangular,polynomial_decay,cosine,fixed}]
[--scoring {sacrebleu,bleu,wer,chrf}] [--task TASK]
[--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--fast-stat-sync]
[--broadcast-buffers]
[--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--arch ARCH]
[--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
[--stop-time-hours STOP_TIME_HOURS]
[--clip-norm CLIP_NORM] [--sentence-avg]
[--update-freq UPDATE_FREQ] [--lr LR] [--min-lr MIN_LR]
[--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE]
[--finetune-from-model FINETUNE_FROM_MODEL]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides OPTIMIZER_OVERRIDES]
[--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES]
[--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS]
[--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
[--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric]
[--patience PATIENCE]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
[--dropout D] [--attention-dropout D]
[--activation-dropout D] [--encoder-embed-path STR]
[--encoder-embed-dim N] [--encoder-ffn-embed-dim N]
[--encoder-layers N] [--encoder-attention-heads N]
[--encoder-normalize-before] [--encoder-learned-pos]
[--decoder-embed-path STR] [--decoder-embed-dim N]
[--decoder-ffn-embed-dim N] [--decoder-layers N]
[--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N]
[--share-decoder-input-output-embed]
[--share-all-embeddings]
[--no-token-positional-embeddings]
[--adaptive-softmax-cutoff EXPR]
[--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--no-cross-attention]
[--cross-self-attention] [--encoder-layerdrop D]
[--decoder-layerdrop D]
[--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
[--quant-noise-pq D] [--quant-noise-pq-block-size D]
[--quant-noise-scalar D] [--pooler-dropout D]
[--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
[--spectral-norm-classification-head]
[--label-smoothing D] [--report-accuracy]
[--ignore-prefix-size IGNORE_PREFIX_SIZE]
[--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
[--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--force-anneal N] [--warmup-updates N]
[--end-learning-rate END_LEARNING_RATE] [--power POWER]
[--total-num-update TOTAL_NUM_UPDATE] [-s SRC]
[-t TARGET] [--load-alignments]
[--left-pad-source BOOL] [--left-pad-target BOOL]
[--max-source-positions N] [--max-target-positions N]
[--upsample-primary UPSAMPLE_PRIMARY]
[--truncate-source] [--num-batch-buckets N]
[--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK]
[--eval-bleu-detok-args JSON] [--eval-tokenized-bleu]
[--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
[--eval-bleu-args JSON] [--eval-bleu-print-samples]
--langs LANG [--prepend-bos]
data
train_models.py: error: unrecognized arguments: --max-sentences 64
fairseq_prepare_and_train failed after 0.87s.
fairseq_train_and_evaluate_with_parametrization failed after 0.87s.

The code:

for exp_name, kwargs in tqdm(kwargs_dict.items()):
    executor = get_executor(
        cluster='debug',
        slurm_partition='priority',
        submit_decorators=[print_function_name, print_args, print_job_id, print_result, print_running_time],
        timeout_min=2 * 24 * 60,
        slurm_comment='EMNLP Arxiv deadline May 1st',
        gpus_per_node=kwargs['train_kwargs']['ngpus'],
        nodes=1,
        slurm_constraint='volta32gb',
        name=exp_name,
    )
    for i in range(5):
        job = executor.submit(fairseq_train_and_evaluate_with_parametrization, **kwargs)
        jobs_dict[exp_name].append(job)
[job.result() for jobs in jobs_dict.values() for job in jobs]

When cluster is "local" ,train fail too

Hi @akafen ,
Thanks for pointing this issue, I'm currently fixing the bugs for model training and I hope I can give you an updated code soon.

@akafen Can you specify the infrastructure specs that you are trying to run the setup ? Like the GPUs and the memory .

@NomadXD Of course I can specify the infrastructure specs that I am trying to run the setup.I am using one gpu to run the code

-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 19% 35C P0 60W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

But I think the error is not about the infrastructure specs."Max sentences" is not the arguments of fairseq-train is the error.From the error information:

train_models.py: error: unrecognized arguments: --max-sentences 64

"Max sentences" is an unrecognized arguments

Yes it's due to a problem in the fairseq version. I'm trying to find a solution to that right now :)

Basically the --max-sentences argument was removed in this PR (fairseq>=0.9.0, that we install using pip) and was later added back for backward compatibility only in this PR (fairseq>=1.0.0a0 not yet on pip).

@louismartin I am using fairseq==0.10.2,so I should remove --max-sentences argument and change it to batch_size?

Yes that is the solution that I just pushed, I also made the training script simpler if you want to train a single model, maybe that can help as well.
Tell me if that works well on your end and we can close the issue.

@louismartin Any plans on porting this to pure PyTorch ? If ported to pytorch, maybe it will be more accessible to people who don't have knowledge about fairseq ?

Hi @Atharva-Phatak ,

Thanks for the message, fairseq uses pytorch. There is no plan do use something else.

@akafen did it solve your issue?

I'm closing the issue but feel free to open a new issue if you have further questions or problems.