XPixelGroup/HAT

RuntimeError: NCCL invalid usage

saqib736 opened this issue · 1 comments

Dear HAT team,
Thank you for your amazing work.
I am trying to train a simple HATx4 model from scratch using my own dataset. I have set up a conda environment and installed all the dependencies and packages according to instructions. but when I try to train, I am getting the following error:

"RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1631630815121/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc.)."

Please guide me on what I am doing wrong or whether I need to change anything. I am trying to train the model on a PC which has 4 tesla gpu's.

Waiting for the kind response.

This issue is solved, i had to change the nproc_per_node to 4 instead of 8.