Multi-node training does not work

Question

Multi-node training does not work

zkcys001 opened this issue 2 years ago · 1 comments

Thanks for your good work!

I have some questions about the multi-node training.
Specificaly, I try your script (mpiexec -n 16 or mpirun ) in 2 nodes by 16 GPUs for the imagenet, but NCCL error still occurs.

Script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8

export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

mpirun python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:8

export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

mpiexec -n 16 python scripts/image_train.py --data_dir /input/datasets/imagenet/train.zip --attention_resolutions 32,16,8 --class_cond True
--diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4
--num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --lr 1e-4 --batch_size 8 --logger_dir '/input/guide_diffusion/image128con'

Error:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Answer 1 · 2023-02-03T12:14:42.000Z

hi @zkcys001 , have you changed the parameter to GPUS_PER_NODE = 8 in the script dist_util.py ?
If so, please try the guided-diffusion repo for 2 nodes 16GPUs training, and see if you have the same NCCL issue.
If so, please also go to the repo guided-diffusion to look for a similar issue and solution.
For example, this issue discussion might be helpful for you issues 22

I think the NCCL problem is usually caused by the GPU cluster and NCCL version.
Personally, I also had some issues with multiple node training when I use their code.