metaopt/torchopt

[Issue Tracker] PyTorch distributed RPC

XuehaiPan opened this issue · 1 comments

This is an issue tracker for the upstream issues:

@XuehaiPan
I found the key point of issue about init_rpc larger than N causing Resource temporarily unavailable.
Intuitively, The cause of this issue is due to the large number of connections initiated to rank 0 "simultaneously".
Based on the hypothesis above, I try to add one line simple code at distributed rpc code of PyTorch and rerun my test code, I got correct result without any error.
My code is insert into here After _init_rpc_states I sleep a while (sleep time equals rank). It indicates that all processors are going to connect leader one by one instead of connecting at same time.
image

But why does the isssu occured ? I check all my tcp configuration in kernel and limited configuration as below. It all looks right here.

net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_syn_backlog = 16384

net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 120

-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) unlimited
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes unlimited
-n: file descriptors 1048576
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 2061498
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited

Accutally, max number of simultaneous connections depends on two parts. One is net.ipv4.tcp_max_syn_backlog configuration. The value of my net.ipv4.tcp_max_syn_backlog = 16384 is enough to for simultaneous connections. The other is listen(fd, backlog) called in C++.
Therefore, I search listen calls in tensorpipe code and I found that all backlog = 128 at all listen calls as follow.
ibv listener
shm listener
uv listener

I dont know if my analysis is right, but after applying my patch as above for rpc code, my test code works every time.