tuning guide
vans163 opened this issue · 3 comments
Is there any chance of a tuning guide, specifically for SCP, such as how to cap a 10gbe line on NVME backed storage.
That's a good question. One of the things we've been thinking about doing is choosing a new set of defaults for hpnssh in order to maximize throughput. However, we need to step back and first ensure that the TCP stack is properly tuned. The best guide for that is from ESNet - https://fasterdata.es.net/host-tuning/. Generally for linux it means really increasing your receive and send buffer sizes and using an appropriate congestion control algorithm like BBR.
Don't think that the buffer sizes listed there are the maximums you can use. I've found that on some paths increasing the buffer sizes to 128MB or even 256MB made a huge difference with hpnssh.
SCP can be a bit slow because it essentially creates a pipe to an SSH process. Pipes can be a bottleneck - one way to test your pipe speed is to run "dd if=/dev/zero bs=1M count=10000 | cat > /dev/null" and then compare that to "dd if=/dev/zero bs=1M count=10000 > cat /dev/null". On my AMD Epyc system the first command gets 2GB/s throughput and the second gets 19 GB/s. That said 2GB/s is still above your target bit rate of 10Gb/s. It's still worth checking.
The next thing you can do is to use a different cipher. The default cipher is chacha-poly1305. In many cases this will be slower than using aes256-ctr (or aes128-ctr). So you should try different ciphers using the '-c[ciphername]' switch. E.g. hpnssh -caes256-ctr. That would use the threaded aes-ctr cipher. You can get a list of all the available ciphers on your system with 'hpnssh -Q cipher' and try different ciphers to see what works best for you.
Lastly, if you have are running Ubuntu Jammy or are building against OpenSSL 3.0+ then the chacha-poly1305 cipher will use a more advanced method of computing the poly1305 part of the cipher.
In my testbed, 2 AMD Epycs connected via 10Gb through a local switch, I can get around 980 to 1050 MB/s (bytes, not bits) using aes128-ctr. If I use the None cipher switch (as well as the None MAC switch) I can usually hit 1100 or 1150 MB/s. You still have an authenticated session but you lose the in flight encryption. https://www.psc.edu/hpn-ssh-home/hpn-readme/ has more information about the None option.
Looking into this now, il edit this as I go along. Quick preliminary notes:
dd if=/dev/zero bs=1M count=10000 > /dev/null
(no cat)- If I keep blocksize to 4096 I get 2GB/s if I make it 1M or 4097 I get 1.4GB/s.
- I get the same speed 18.9GB/s using the non-pipe dd.
Yea cant get anywhere.
So enabling BBR while using iperf3 on a public 108ms link is nice.
default
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 7.14 MBytes 59.9 Mbits/sec 0 3.52 MBytes
[ 5] 1.00-2.00 sec 76.0 MBytes 637 Mbits/sec 73 6.57 MBytes
[ 5] 2.00-3.00 sec 61.9 MBytes 519 Mbits/sec 0 6.84 MBytes
[ 5] 3.00-4.00 sec 64.4 MBytes 540 Mbits/sec 0 7.08 MBytes
[ 5] 4.00-5.00 sec 66.7 MBytes 560 Mbits/sec 0 7.28 MBytes
[ 5] 5.00-6.00 sec 68.0 MBytes 571 Mbits/sec 0 7.44 MBytes
[ 5] 6.00-7.00 sec 69.2 MBytes 580 Mbits/sec 0 7.57 MBytes
[ 5] 7.00-8.00 sec 70.2 MBytes 589 Mbits/sec 0 7.67 MBytes
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.16 MBytes 34.9 Mbits/sec 0 2.01 MBytes
[ 5] 1.00-2.00 sec 98.9 MBytes 830 Mbits/sec 231 24.6 MBytes
[ 5] 2.00-3.00 sec 111 MBytes 930 Mbits/sec 89 24.1 MBytes
[ 5] 3.00-4.00 sec 111 MBytes 929 Mbits/sec 89 24.1 MBytes
[ 5] 4.00-5.00 sec 111 MBytes 930 Mbits/sec 91 24.1 MBytes
[ 5] 5.00-6.00 sec 111 MBytes 930 Mbits/sec 88 24.0 MBytes
But notice the extra retries, guessing that is why distro's don't enable it by default. Maybe extra noise produced on public internet?
Both client and server using this sysctl.
net.ipv4.ip_unprivileged_port_start = 22
net.core.netdev_max_backlog = 307200
net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_notsent_lowat = 131072
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
The scp or hpnscp (both run at the exact same speed) hpnscp -c aes256-ctr /tmp/3G test@test:/tmp
produce reports like this (each second)
1s 4MB/s
2s 6MB/s
3s 7MB/s
4s 8MB/s
..
14s 17MB/s
15s 17.2MB/s
16s 17.4MB/s
17s 17.7MB/s
It keeps growing the window I assume but very slowly. Like iperf just blasts at linespeed after the first second. Any thoughts?