Multiple nodes slurm training
Salvatore-tech opened this issue · 2 comments
Good morning,
I have read the documentation about finetuning and i'd like to lauch train.py to finetune PwcNet on my dataset loading a checkpoint file.
I have created a script to load the dataset under /configs/base/datasets and used in the config script under /configs/pwcnet.
I'd like to reduce the training time, using all the resource i have available on a small cluster with 4 nodes and 4 GPU (NVLINK) NVIDIA Tesla V100 32GB SXM2.
Do you see any rooms for improvements with the following command?
srun -p xgpu --job-name=pwc_kitti --gres=gpu:4 --ntasks=16 --ntasks-per-node=4 --cpus-per-task=2 --kill-on-bad-exit=1 python -u tools/train.py $MMFLOW/configs/pwcnet/pwcnet_ft_4x1_300k_kitti_320x896.py --work-dir=$MMFLOW/work_dir/pwckitti --launcher=slurm
It estimates about 1 day to complete finetuning, do you think i'm using all 4 nodes correctly?
If so, can i reduce the training iterations to require less time?
Thanks in advance!
Your command can use all your resources.
If you want to reduce the training time, you can reduce the number of iterations. The config in pwcnet_ft_4x1_300k_kitti_320x896.py
follows the settings of the original paper, which means the batch size
is 4.
According to your command, your batch size
is increased to 16, so there is no need to train as many times as the original paper.
Besides, according to my experience in using SGD
, when the batch size
is increased by a factor of 4, the learning rate
should also be increased by a factor of 4. This change can help speed up the convergence. But the optical flow task uses Adam
as the optimizer, so I'm not sure if this strategy will still work well, you can have a try.
Thanks @Zachary-66 for your answer, i'm closing it.