Multi-node training does not work
zkcys001 opened this issue · 1 comments
Thanks for your good work!
I have some questions about the multi-node training.
Specificaly, I try your script (mpiexec -n 16 or mpirun ) in 2 nodes by 16 GPUs for the imagenet, but NCCL error still occurs.
Script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
mpirun python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
mpiexec -n 16 python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'
Error:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
hi @zkcys001 , have you changed the parameter to GPUS_PER_NODE = 8
in the script dist_util.py
?
If so, please try the guided-diffusion repo for 2 nodes 16GPUs training, and see if you have the same NCCL issue.
If so, please also go to the repo guided-diffusion to look for a similar issue and solution.
For example, this issue discussion might be helpful for you issues 22
I think the NCCL problem is usually caused by the GPU cluster and NCCL version.
Personally, I also had some issues with multiple node training when I use their code.