broker crashing, rv < 0: network is unreachable
etaoxing opened this issue · 3 comments
I'm running the IMPALA vtrace example, on one machine with a single peer. Using export BROKER_IP=$(echo $SSH_CONNECTION | cut -d' ' -f3)
, the moolib.broker
crashes at some point in the run (and the peer job also crashes):
$ python -m moolib.broker
Broker listening at 0.0.0.0:4431
terminate called after throwing an instance of 'std::runtime_error'
what(): In connectFromLoop at .../moolib/src/tensorpipe/tensorpipe/transport/uv/uv.h:313 "rv < 0: network is unreachable"
Aborted (core dumped)
However, I have not run into this issue yet when starting the peer using BROKER_IP=0.0.0.0
.
Do things work when you use BROKER_IP=0.0.0.0
? Or BROKER_IP=127.0.0.1
?
This seems to be an error from tensorpipe when connecting to some kind of bad address, it'd be interesting to see what echo $SSH_CONNECTION
outputs for you, ie. what it's setting BROKER_IP to.
Yes, I've let Atari train for 8+ hours using BROKER_IP=0.0.0.0
, and things seem to be working. I don't have the capacity to try >1 peers across multiple machines, but I've run multiple peers locally and things seem to work.
echo $SSH_CONNECTION
gives xx.xx.xxx.x 5xxxx 128.x.xxx.xxx 22
, where the second address gives the IP of the server.
Should this be considered a tensorpipe issue then? Would be nice to catch this exception and then stall, instead of crashing the run.
Thanks, I will try to investigate.
The SSH_CONNECTION trick seems to just not work in your case, but it might work if you manually input the IP of the broker.
Either way, you're right that this error should be caught and not result in a fatal error.