[Issue Tracker] PyTorch distributed RPC
XuehaiPan opened this issue · 1 comments
This is an issue tracker for the upstream issues:
-
Initialize RPC with large world size:
-
Pass
nn.Module
andnn.Parameter
as RPC argument:
@XuehaiPan
I found the key point of issue about init_rpc larger than N causing Resource temporarily unavailable.
Intuitively, The cause of this issue is due to the large number of connections initiated to rank 0 "simultaneously".
Based on the hypothesis above, I try to add one line simple code at distributed rpc code of PyTorch and rerun my test code, I got correct result without any error.
My code is insert into here After _init_rpc_states I sleep a while (sleep time equals rank). It indicates that all processors are going to connect leader one by one instead of connecting at same time.
But why does the isssu occured ? I check all my tcp configuration in kernel and limited configuration as below. It all looks right here.
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_syn_backlog = 16384
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 120
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) unlimited
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes unlimited
-n: file descriptors 1048576
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 2061498
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
Accutally, max number of simultaneous connections depends on two parts. One is net.ipv4.tcp_max_syn_backlog configuration. The value of my net.ipv4.tcp_max_syn_backlog = 16384 is enough to for simultaneous connections. The other is listen(fd, backlog) called in C++.
Therefore, I search listen calls in tensorpipe code and I found that all backlog = 128 at all listen calls as follow.
ibv listener
shm listener
uv listener
I dont know if my analysis is right, but after applying my patch as above for rpc code, my test code works every time.