Improve socket reuse to avoid using too many file descriptors
Closed this issue · 5 comments
If I run a slow_cooker with 500 clients, we can run out file descriptors quickly.
slow_cooker -host "perf-cluster" -qps 20 -concurrency 500 -interval 10s http://proxy-test-4d:7474
results in a panic.
$ RUST_LOG=error RUST_BACKTRACE=yes ./linkerd-tcp-1490585634 example.yml
Listening on http://127.0.0.1:9989.
thread 'main' panicked at 'could not run proxies: Error { repr: Os { code: 24, message: "Too many open files" } }', /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/result.rs:868
stack backtrace:
1: 0x557763c6f7ac - std::sys::imp::backtrace::tracing::imp::write::hf33ae72d0baa11ed
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
2: 0x557763c72abe - std::panicking::default_hook::{{closure}}::h59672b733cc6a455
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:351
3: 0x557763c726c4 - std::panicking::default_hook::h1670459d2f3f8843
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:367
4: 0x557763c72f5b - std::panicking::rust_panic_with_hook::hcf0ddb069e7beee7
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:555
5: 0x557763c72df4 - std::panicking::begin_panic::hd6eb68e27bdf6140
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:517
6: 0x557763c72d19 - std::panicking::begin_panic_fmt::hfea5965948b877f8
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:501
7: 0x557763c72ca7 - rust_begin_unwind
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:477
8: 0x557763c9f34d - core::panicking::panic_fmt::hc0f6d7b2c300cdd9
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/panicking.rs:69
9: 0x5577639d7642 - core::result::unwrap_failed::h52f3f53af574d319
10: 0x5577639dcf41 - linkerd_tcp::main::h2f95da4c40bc36fe
11: 0x557763c79f7a - __rust_maybe_catch_panic
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
12: 0x557763c736c6 - std::rt::lang_start::hd7c880a37a646e81
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:436
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panic.rs:361
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/rt.rs:57
13: 0x7fe85559a3f0 - __libc_start_main
14: 0x5577639d5f68 - <unknown>
15: 0x0 - <unknown>```
Might want to play with SO_LINGER on socket shutdown to free up descriptors quicker on socket.close. I know twisted uses something like this for TCPServers too. I've only got a D snippet handy, but I'm sure it's similar in rust.
// Set SO_LINGER to 1,0 which, by convention, causes a
// connection reset to be sent when close is called,
/// instead of the standard FIN shutdown sequence.
int[2] option = [ 1, 0 ];
this.socket.handle.setsockopt(SOL_SOCKET, SO_LINGER, &option, option.sizeof);
Sorry I didn't clarify the use case. It should be used for sudden/abnormal connection loss.
This is a straight up bug. socket options won't fix this. I think I know how to fix this...
Problem
linkerd-tcp does not close destination connections when the source client closes a connection.
Reproduction Case
Run linkerd-tcp:
lb=:; cargo run -- example.yaml
...
Run a web server:
web=:; twistd -n web -p 8880
...
2017-03-31 21:47:29+0000 [HTTPChannel,1014,127.0.0.1] 127.0.0.1 - - [31/Mar/2017:21:47:29 +0000] "GET / HTTP/1.1" 200 199 "-" "curl/7.43.0"
2017-03-31 21:47:29+0000 [-] Malformed file descriptor found. Preening lists.
2017-03-31 21:47:29+0000 [-] bad descriptor <HTTPChannel #1015 on 8880>
2017-03-31 21:47:30+0000 [-] Malformed file descriptor found. Preening lists.
2017-03-31 21:47:30+0000 [-] bad descriptor <HTTPChannel #1016 on 88
Monitor linkerd-tpc's connections:
netstat=:; while true ; do netstat -an |awk '$4 ~ /127\.0\.0\.1\.7474/ { print $4" "$6 }; $5 ~ /127\.0\.0\.1\.8880/ { print $5" "$6 }' | sort |uniq -c |sort -rn ; sleep 10 ;echo ; done
944 127.0.0.1.8880 ESTABLISHED
943 127.0.0.1.7474 CLOSE_WAIT
1 127.0.0.1.7474 LISTEN
1 127.0.0.1.7474 ESTABLISHED
974 127.0.0.1.8880 ESTABLISHED
973 127.0.0.1.7474 CLOSE_WAIT
1 127.0.0.1.7474 LISTEN
1 127.0.0.1.7474 ESTABLISHED
1003 127.0.0.1.8880 ESTABLISHED
1002 127.0.0.1.7474 CLOSE_WAIT
1 127.0.0.1.7474 LISTEN
1 127.0.0.1.7474 ESTABLISHED
Monitor linkerd-tcp's metrics:
metrics=:; while true ; do curl -s http://localhost:9989/metrics | grep conns_ | sort ; sleep 10 ; echo ; done
conns_active{proxy="default"} 944
conns_established{proxy="default"} 0
conns_pending{proxy="default"} 1
conns_active{proxy="default"} 974
conns_established{proxy="default"} 1
conns_pending{proxy="default"} 0
Then run a crappy client that doesn't tell the web server to tear down the connection:
crapclient=:; while true ; do curl -s "localhost:7474" >dev/null && echo -n . ; done
......
Note that this behavior is not observed with slow_cooker.
We observe that the connections to linkerd-tcp are in CLOSE_WAIT
, indicating that
linkerd-tcp has not closed its half of the connection. Furthermore, linkerd has not
attempted to close the connection to the destination either, as these connections are
still ESTABLISHED
.
Solution
tokio_io's AsyncWrite
provides a shutdown
. We need to make sure that serverside
shutdowns tear down the duplex stream.
After further digging, I've learned that AsyncWrite::shutdown
has no relationship to TcpStream::shutdown
. (Surprise!)
We need to use TcpStream::shutdown
instead