facebookresearch/atlas

RuntimeError: ProcessGroupNCCL does not support gather

jeffhj opened this issue · 1 comments

Hi,

I try to train the models with multiple nodes on a slurm cluster. However, I get "RuntimeError: ProcessGroupNCCL does not support gather" in Line 95, dist_utils.py.

image

image

Do you have any suggestions to handle this issue, e.g., use all_gather to replace gather here? Or ProcessGroupNCCL is not expected here? I am quite confused in debugging this and thanks for any help!

Make the training run by changing this line to dist.gather(x, gather_list=tensor_list, dst=dst, group=slurm.get_gloo_group()).

So I guess the default group is ProcessGroupNCCL while ProcessGroupGLOO is expected here so we need to assign the group explicitly?