pytorch/examples

multi-node DDP

Tabatabaei1999 opened this issue · 0 comments

Hello, I wanted to run multi-node in two machines, each one has Ubuntu 20.04 LTS as the operating system and in each one, we have an environment with the same Pytorch version (2.0), one of them has one GPU (NVIDIA RTX 3080) and the other have one GPU too (NVIDIA RTX 3090), as I read in torch example I wanted to use NVIDIA NCCL as back (I don’t install NCCL on the system because I know during Pytorch installation, NCCL be installed automatically), and I get multi-node DDP code from GitHub (the official Pytorch example = https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py)
After that I connected these two machines with a LAN cable, and got 192.168.24.10(net mask = 255.255.255.0) and 192.168.24.20(net mask = 255.255.255.0) as the IP of each machine, after that I ping machines, It was ok.
So I ran this command
In machine 0(master):
torchrun
--nproc_per_node=1 --nnodes=2 --node_rank=0
--master_addr=192.168.24.10 --master_port=1234 \
multinode.py 50 10
In machine 1:
torchrun
--nproc_per_node=1 --nnodes=2 --node_rank=1
--master_addr=192.168.24.10 --master_port=1234 \
multinode.py 50 10
After running noting logged in these terminals, there was no error or warning.
I think that this occurred because the network couldn’t work, but I checked the link with Wireshark and I watched that after running these commands, TCP packets were sent and received between 192.168.24.20 and 192.168.24.10
Can you help me to run multi-node DDP?