rapier1/hpn-ssh

tuning guide

vans163 opened this issue · 3 comments

Is there any chance of a tuning guide, specifically for SCP, such as how to cap a 10gbe line on NVME backed storage.

That's a good question. One of the things we've been thinking about doing is choosing a new set of defaults for hpnssh in order to maximize throughput. However, we need to step back and first ensure that the TCP stack is properly tuned. The best guide for that is from ESNet - https://fasterdata.es.net/host-tuning/. Generally for linux it means really increasing your receive and send buffer sizes and using an appropriate congestion control algorithm like BBR.

Don't think that the buffer sizes listed there are the maximums you can use. I've found that on some paths increasing the buffer sizes to 128MB or even 256MB made a huge difference with hpnssh.

SCP can be a bit slow because it essentially creates a pipe to an SSH process. Pipes can be a bottleneck - one way to test your pipe speed is to run "dd if=/dev/zero bs=1M count=10000 | cat > /dev/null" and then compare that to "dd if=/dev/zero bs=1M count=10000 > cat /dev/null". On my AMD Epyc system the first command gets 2GB/s throughput and the second gets 19 GB/s. That said 2GB/s is still above your target bit rate of 10Gb/s. It's still worth checking.

The next thing you can do is to use a different cipher. The default cipher is chacha-poly1305. In many cases this will be slower than using aes256-ctr (or aes128-ctr). So you should try different ciphers using the '-c[ciphername]' switch. E.g. hpnssh -caes256-ctr. That would use the threaded aes-ctr cipher. You can get a list of all the available ciphers on your system with 'hpnssh -Q cipher' and try different ciphers to see what works best for you.

Lastly, if you have are running Ubuntu Jammy or are building against OpenSSL 3.0+ then the chacha-poly1305 cipher will use a more advanced method of computing the poly1305 part of the cipher.

In my testbed, 2 AMD Epycs connected via 10Gb through a local switch, I can get around 980 to 1050 MB/s (bytes, not bits) using aes128-ctr. If I use the None cipher switch (as well as the None MAC switch) I can usually hit 1100 or 1150 MB/s. You still have an authenticated session but you lose the in flight encryption. https://www.psc.edu/hpn-ssh-home/hpn-readme/ has more information about the None option.

Looking into this now, il edit this as I go along. Quick preliminary notes:

  • dd if=/dev/zero bs=1M count=10000 > /dev/null (no cat)
  • If I keep blocksize to 4096 I get 2GB/s if I make it 1M or 4097 I get 1.4GB/s.
  • I get the same speed 18.9GB/s using the non-pipe dd.

Yea cant get anywhere.

So enabling BBR while using iperf3 on a public 108ms link is nice.

default
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.14 MBytes  59.9 Mbits/sec    0   3.52 MBytes       
[  5]   1.00-2.00   sec  76.0 MBytes   637 Mbits/sec   73   6.57 MBytes       
[  5]   2.00-3.00   sec  61.9 MBytes   519 Mbits/sec    0   6.84 MBytes       
[  5]   3.00-4.00   sec  64.4 MBytes   540 Mbits/sec    0   7.08 MBytes       
[  5]   4.00-5.00   sec  66.7 MBytes   560 Mbits/sec    0   7.28 MBytes       
[  5]   5.00-6.00   sec  68.0 MBytes   571 Mbits/sec    0   7.44 MBytes       
[  5]   6.00-7.00   sec  69.2 MBytes   580 Mbits/sec    0   7.57 MBytes       
[  5]   7.00-8.00   sec  70.2 MBytes   589 Mbits/sec    0   7.67 MBytes 

net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.16 MBytes  34.9 Mbits/sec    0   2.01 MBytes       
[  5]   1.00-2.00   sec  98.9 MBytes   830 Mbits/sec  231   24.6 MBytes       
[  5]   2.00-3.00   sec   111 MBytes   930 Mbits/sec   89   24.1 MBytes       
[  5]   3.00-4.00   sec   111 MBytes   929 Mbits/sec   89   24.1 MBytes       
[  5]   4.00-5.00   sec   111 MBytes   930 Mbits/sec   91   24.1 MBytes       
[  5]   5.00-6.00   sec   111 MBytes   930 Mbits/sec   88   24.0 MBytes 

But notice the extra retries, guessing that is why distro's don't enable it by default. Maybe extra noise produced on public internet?

Both client and server using this sysctl.

net.ipv4.ip_unprivileged_port_start = 22
net.core.netdev_max_backlog = 307200

net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_notsent_lowat = 131072

net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq

The scp or hpnscp (both run at the exact same speed) hpnscp -c aes256-ctr /tmp/3G test@test:/tmp
produce reports like this (each second)

1s 4MB/s
2s 6MB/s
3s 7MB/s
4s 8MB/s
..
14s 17MB/s
15s 17.2MB/s
16s 17.4MB/s
17s 17.7MB/s

It keeps growing the window I assume but very slowly. Like iperf just blasts at linespeed after the first second. Any thoughts?