About the problem of multi-node running stuck

Question

About the problem of multi-node running stuck

AntyRia opened this issue a year ago · 1 comments

My machine: 2 machines with different ips and 2 available Gpus on each machine

When I use the multigpu_torchrun.py example, when I pass these two directives:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
and
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=172.xx.1.150:29603 multi_node_torchrun.py 50 10
When I started, the program got stuck in self.model = DDP(self.model, device_ids=[self.local_rank]) and stopped running, But with nvidia-smi we can see that processes on both machines have been created and are already occupying memory. I wonder why

Looking through the history I was able to find similar issues, saying they involved synchronization deadlocks, but I don't think that was the root cause since I was using the official example.

Answer 1 · 2024-07-22T13:34:44.000Z

@AntyRia I have the similar issue. Did you able to solve the problem?