About the problem of multi-node running stuck
AntyRia opened this issue · 1 comments
My machine: 2 machines with different ips and 2 available Gpus on each machine
When I use the multigpu_torchrun.py example, when I pass these two directives:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
and
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
When I started, the program got stuck in self.model = DDP(self.model, device_ids=[self.local_rank])
and stopped running, But with nvidia-smi
we can see that processes on both machines have been created and are already occupying memory. I wonder why
Looking through the history I was able to find similar issues, saying they involved synchronization deadlocks, but I don't think that was the root cause since I was using the official example.
@AntyRia I have the similar issue. Did you able to solve the problem?