[Issue Tracker] PyTorch distributed RPC

Question

[Issue Tracker] PyTorch distributed RPC

XuehaiPan opened this issue 2 years ago · 1 comments

This is an issue tracker for the upstream issues:

Initialize RPC with large world size:
Pass nn.Module and nn.Parameter as RPC argument:
- pytorch/pytorch#86525

Answer 1 · 2022-10-10T09:18:18.000Z

@XuehaiPan
I found the key point of issue about init_rpc larger than N causing Resource temporarily unavailable.
Intuitively, The cause of this issue is due to the large number of connections initiated to rank 0 "simultaneously".
Based on the hypothesis above, I try to add one line simple code at distributed rpc code of PyTorch and rerun my test code, I got correct result without any error.
My code is insert into here After _init_rpc_states I sleep a while (sleep time equals rank). It indicates that all processors are going to connect leader one by one instead of connecting at same time.

But why does the isssu occured ? I check all my tcp configuration in kernel and limited configuration as below. It all looks right here.

net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_syn_backlog = 16384

net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 120

-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) unlimited
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes unlimited
-n: file descriptors 1048576
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 2061498
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited

Accutally, max number of simultaneous connections depends on two parts. One is net.ipv4.tcp_max_syn_backlog configuration. The value of my net.ipv4.tcp_max_syn_backlog = 16384 is enough to for simultaneous connections. The other is listen(fd, backlog) called in C++.
Therefore, I search listen calls in tensorpipe code and I found that all backlog = 128 at all listen calls as follow.
ibv listener
shm listener
uv listener

I dont know if my analysis is right, but after applying my patch as above for rpc code, my test code works every time.