HankYe/PAGCP

DDP训练报错,单卡训练没有问题

Opened this issue · 1 comments

Hello, thanks for ur attention to our work. It seems to be the communication issue between different GPUs in the cluster, which might be caused by network latency or load imbalance between GPUs. Could you check the CUDA version and torch version? Upgrading the version may be one approach. Another way to solve the problem is to increase the timeout limit NCCL_TIMEOUT_MS.