Connection refused error when running benchmark on multiple nodes

Question

Connection refused error when running benchmark on multiple nodes

Closed this issue 5 years ago · 1 comments

I ran
./benchmark_cuda --size 2 --rank 0 --redis-host 10.251.209.33 --redis-port 6379 --prefix 0 --transport tcp --elements 100 --iteration-time 1s cuda_allreduce_ring_chunked
command on one node, and ran
./benchmark_cuda --size 2 --rank 1 --redis-host 10.251.209.33 --redis-port 6379 --prefix 0 --transport tcp --elements 100 --iteration-time 1s cuda_allreduce_ring_chunked
on another node, and got error
[<some path>/gloo/gloo/transport/tcp/pair.cc:761] connect [127.0.1.1]:60817: Connection refused.

I ran benchmark on 2 GPUs on one node the same way and it worked fine. What is wrong?

Answer 1 · 2019-10-21T06:34:14.000Z

It looks like you're trying to use the loopback interface (localhost) instead of the external network interface. By default, Gloo will resolve the machine's hostname to find its external IP address. In your case, this must resolve to 127.0.1.1 instead of the external IP address. You can override this default behavior by specifying --tcp-device [INTERFACE NAME].