Connection refused error when running benchmark on multiple nodes
Closed this issue · 1 comments
I ran
./benchmark_cuda --size 2 --rank 0 --redis-host 10.251.209.33 --redis-port 6379 --prefix 0 --transport tcp --elements 100 --iteration-time 1s cuda_allreduce_ring_chunked
command on one node, and ran
./benchmark_cuda --size 2 --rank 1 --redis-host 10.251.209.33 --redis-port 6379 --prefix 0 --transport tcp --elements 100 --iteration-time 1s cuda_allreduce_ring_chunked
on another node, and got error
[<some path>/gloo/gloo/transport/tcp/pair.cc:761] connect [127.0.1.1]:60817: Connection refused
.
I ran benchmark on 2 GPUs on one node the same way and it worked fine. What is wrong?
It looks like you're trying to use the loopback interface (localhost
) instead of the external network interface. By default, Gloo will resolve the machine's hostname to find its external IP address. In your case, this must resolve to 127.0.1.1
instead of the external IP address. You can override this default behavior by specifying --tcp-device [INTERFACE NAME]
.