RuntimeError: ProcessGroupNCCL does not support gather
jeffhj opened this issue · 1 comments
jeffhj commented
Hi,
I try to train the models with multiple nodes on a slurm cluster. However, I get "RuntimeError: ProcessGroupNCCL does not support gather" in Line 95, dist_utils.py.
Do you have any suggestions to handle this issue, e.g., use all_gather to replace gather here? Or ProcessGroupNCCL is not expected here? I am quite confused in debugging this and thanks for any help!
jeffhj commented
Make the training run by changing this line to dist.gather(x, gather_list=tensor_list, dst=dst, group=slurm.get_gloo_group())
.
So I guess the default group is ProcessGroupNCCL while ProcessGroupGLOO is expected here so we need to assign the group explicitly?