rapier1/hpn-ssh

Results of paper "SSH Performance"

nh2 opened this issue · 10 comments

nh2 commented

I read http://allanjude.com/bsd/AsiaBSDCon2017_-_SSH_Performance.pdf and it lists some issues and suggested improvements to HPN.

Is it documented anywhere which of those were already merged / what their status is?

CC @allanjude

I think the main thing is just the interactive check, which should go directly in upstream.

I have rebased it here:

https://github.com/allanjude/openssh-portable/tree/openssh_interactive_window

I have also rebased (but not even tried to compile test yet) the original work from 2017 here:

https://github.com/allanjude/openssh-portable/tree/bsdcan2017_rebase but have not had time to clean it up properly.

I haven't read this paper so thank you for point it out. As an aside, we received some new funding and are planning on a series of improvements to HPN-SSH that should position it for a 10Gb world. Hopefully. Ideas are easy but engineering is hard :) The start of this has been delayed because of other demands on my time but we hope to have some preliminary results by the end of the year.

You can get more information from:
https://www.psc.edu/hpn-ssh/community-guide

The paper might be a good start, it was tested using 40G nics in place to 10G, since most of the limits in my testing were in the 7-15 gbps range.

I think unifying the buffer sizes, and making them 32k instead of 16k, made the biggest difference in performance.

I had dtrace flamegraphs showing that I had gotten most of the time spent to be in memcpy(), rather than spent doing other things, so that helped.

@allanjude

I just rolled in some of the changes you had on your git repo. This includes the buffers changes and the NONE MAC. On my system it's clocking in at 150% faster than chacha20 and 30% faster than hpn-ssh with the none cipher. These are preliminary numbers but that's a notable performance boost (1600Mbps faster is nothing to sneeze at). I'm not sure about the options logic (as you can have a null mac with a legit cipher) but that can be resolved.

I have a new branch on my github called aj-extensions if you want to take a look.

I need to do some more testing to compare the impact of the buffer changes but I wanted to tell you what I'm seeing with your work.

Chris

@rapier1
As an aside to myself - if we are doing NONE we can probably skip the rekeying after max_packets. It literally doesn't make any sense. Shouldn't make a big difference in throughput but it's a useless operation.

@allanjude

I've done some more extensive testing on the changes you have proposed. I'm seeing some issues with using the 32k uniform buffers with a standard OpenSSH at higher bitrates. Specifically when I am using the None Cipher on it's own or with None Mac. I'm seeing a burst of really good throughput and then it bottoms out for seconds. I'm not entirely sure what's causing this but my assumption is that the mismatch between the incoming datagrams and the receivers buffers are causing it to drop packets all over the floor. Is there any way you could try to confirm this for me?

The buffer normalization really does help but if this is a real issue (and not just some madness on the part of my setup) I'm going to need to make this a negotiated size rather than a default.

The NoneMac is a win though. I'm working on rolling that into a new 8.4 release. I'm going to have to ensure that it can only be used in context of the NoneSwitch though. However, for testing purposes it really helps me identify the overhead that the MAC imposes. One of the goals is to push the MAC process on to a different pipeline so this will let me know if that's actually helping.

@rapier1,

More notes to myself. It's hard to actually disable rekeying entirely for the None cipher but I did decrease the frequency substantially. It's not making a huge difference in throughput (really within measurement error) but I'm rolling that out.

@g3ntry
My default is BBR for the test bed I have running. I've also tried this with HTCP and Cubic just for the sake of completeness. What I'm seeing looks like some sort of pausing inside during the first 2 to 4 GB of data transferred. During longer transfers this pause averages out. However, during shorter transfers the impact of this is pretty clear.

For example, during a 100GB transfer I'm averaging 640MB/s. However, for a 3GB transfer the speeds range from 250MB/s to 500MB/s. I'm only seeing this when I'm sending to a buffer optimized sshd from a non-optimized client. This very well could be an issue with my setup. I'm going to be conducting more tests using some different hosts including some that I know will be resource constrained. Hopefully I'll find out the problem is all on my end.

These are the results from a matrix of tests between different versions of hpnssh with the suggestions from @allanjude.
bufnone = buffer normalization with none mac
buftest = buffer normalization
hpnssh = base hpnssh with default cipher
hpnsshnone = base hpnssh with none cipher
nonemac = none mac (no buffer changes)
ssh = stock ssh
All values are in MB/s. Any entry with an 'x' indicates an incompatible set of options.
All values are the average of 40 iterations of a 15GiB 'dd if=/dev/zero' pipe /dev/null via ssh. I'll be doing more statistical analysis soon (std deviation, mode, median, p value, etc).
Source is an Intel Xeon CPU X5675 @ 3.07GHz (6 cores, 12 threads)
Sink is an Intel Core i7-2600K CPU @ 3.40GHz (4 cores, 8 threads)
The test network is 10GB DAC through a mikrotik 10GB switch. 0.208ms avg RTT. 254KB BDP

I will be rerunning the tests reversing the source and sink. Later tests will also include increasing the RTT and using a 6 core ARM system (Max tput of ~6Gbs).

                bufnone buftest hpnssh  hpnsshnone      nonemac ssh
bufnone         890.225 x	x	x               859.75  x
buftest         648.05  646.85  630.875 624.675         624.55  x
hpnssh          338     337.225 330.275 331.875         330.2   233.025
hpnsshnone	617.875 614.2   599.125 600.7           592.65  x
nonemac         842.825 x	x	x               789.85  x
ssh             334.75  336.6   328.9   329.025         327.65  230.3

So the results look good at this point. Obviously the none mac makes a big difference. The buffer normalization is also making a difference - around 5% in this test bed. Assuming the other tests don't show any major issues I'll be incorporating them in to 8.4 sometime next week.