TencentARC/T2I-Adapter

torch.nn.parallel.DistributedDataParallel hang on

Crd1140234468 opened this issue · 5 comments

I encountered ‘torch.nn.parallel.DistributedDataParallel hang on’ problemwhen I run the train_depth.py. I found that the program cannot enter the statement "dist._verify_model_across_ranks"
image
How to solve this problem

This is a function inside torch

Also, here's the problem I'm having with multiple GPUs

MC-E commented

what's the command you run?

what's the command you run?

CUDA_VISIBLE_DEVICES=1,3 python -m torch.distributed.launch --nproc_per_node=2 --master_port 8888 test11.py --bsize=8

what's the command you run?
Currently, model_ad can be loaded into torch.nn.parallel.DistributedDataParallel, but when the model is set to sd-v1-4.ckpt, it cannot be loaded into torch.nn.parallel.DistributedDataParallel, and it will get stuck.