rapier1/hpn-ssh

AES-CTR MT slower than vanilla

allanjude opened this issue · 13 comments

In my tests, the MT implementation of AES-CTR appears to be significantly slower than the implementation used in the current version of OpenSSH.

This is on a high end server:
Intel(R) Xeon(R) CPU E5-1650 v3, 6 cores @ 3.50GHz + HT (12 threads)

In a job that sends data from the server, to a receiving client:
ssh -c aes128-ctr -m umac-64-etm@openssh.com user@host dd if=/dev/zero bs=128k | dd of=/dev/null bs=128k

MT-AES-CTR on both sides:
2000846848 bytes transferred in 10.001863 secs (200047414 bytes/sec)

MT-AES-CTR on client side only:
2884976640 bytes transferred in 10.050173 secs (287057417 bytes/sec)

MT-AES-CTR on server side only:
2868396032 bytes transferred in 10.002210 secs (286776221 bytes/sec)

MT-AES-CTR disabled on both sides:
5973835776 bytes transferred in 10.001882 secs (597271167 bytes/sec)

I tried recompiling with CIPHER_THREADS increased from 2 to 4. It makes it use more CPU, but throughput only goes up fractionally to around 300 MB/s.

what version are you testing against exactly ?

All of the tests were done again:
OpenSSH_7.3p1-hpn14v12, OpenSSL 1.0.2j-freebsd 26 Sep 2016

just with or without -odisablemtaes=yes on the client/server side

but same results my git checkout on the client side, to an entirely unpatched server:
OpenSSH_7.2p2, OpenSSL 1.0.2j-freebsd 26 Sep 2016

with -odisablemtaes=no
2874408960 bytes transferred in 10.001432 secs (287399754 bytes/sec)

with -odisablemtaes=yes:
5847531520 bytes transferred in 10.034207 secs (582759730 bytes/sec)

In this case it is a very high end system, Intel(R) Xeon(R) CPU E5-1650 v3, 6 cores @ 3.50GHz + HT (compared to an E5-2650 this machine has a much higher CPU frequency)

These tests can be done just over localhost to eliminate any network effects. But my tests were done over Chelsio T580 40gbps NICs connected back-to-back between a pair of these machines (made available to me by the FreeBSD Foundation for testing)

I did not use dev/random because it would likely be the performance bottleneck.

I just ran:
ssh -c aes128-ctr -m umac-64-etm@openssh.com user@localhost dd if=/dev/zero bs=128k | dd of=/dev/null bs=128k

varied by adding '-odisablemtaes=yes' on the client, server, and both.

I have been using timelimit(1) to make the measurements 30 seconds.:
timelimit -q -t 30

I see the threading working, it just isn't faster than without the multi-threading:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
84592 root 5 85 0 24720K 6080K CPU1 1 0:49 242.77% ssh
84595 test 5 85 0 27404K 6168K CPU7 7 0:49 240.59% sshd

I also tried underclocking my system to see what impact a slower clock speed (1200mhz vs 3500mhz) had:

(MT: C: ON, S: ON):
3500mhz: 8771600384 bytes transferred in 30.001552 secs (292371553 bytes/sec)
1200mhz: 3004858368 bytes transferred in 30.002547 secs (100153442 bytes/sec)

(MT: C: OFF, S: ON):
3500mhz: 8785854464 bytes transferred in 30.001381 secs (292848338 bytes/sec)
1200mhz: 2982739968 bytes transferred in 30.002141 secs (99417571 bytes/sec)

(MT: C: ON, S: OFF):
3500mhz: 8786575360 bytes transferred in 30.002099 secs (292865359 bytes/sec)
1200mhz: 3018129408 bytes transferred in 30.090607 secs (100301381 bytes/sec)

(MT: C: OFF, S: OFF):
3500mhz: 20275757056 bytes transferred in 30.030299 secs (675176654 bytes/sec)
1200mhz: 6847234048 bytes transferred in 30.078079 secs (227648646 bytes/sec)

My test system is a single socket Xeon E5-1650, but yes, I am not sure where along the way the threading seemed to stop helping.

If it could be made to work again, I think it would be very helpful, as many machines has 8-32 cores now, and are constrained by the performance of one core when trying to transfer data at 10 and 40 gbps.

So I finally found out what is going on. Several years ago the AES cypher system was updated to AES New Instructions (AES-NI) and it looks like they significantly improved the single stream efficiency. Since AES-CTR-MT predates that code it doesn't include the NI enhancements. The original author of the AES-CTR-MT code is taking a look at it now and we may be able to incorporate NI. I'll keep people informed.

As a note: It looks like AES-NI is dependent on a version of OpenSSL > 1.0 and requires that the AES-NI instruction set be present in the CPU itself (so Intel Westmere cores and later). As such, the performance bump is arch dependent.

hmm, can we detect the presence of AES-NI dynamically ? i know cpuid can tell you about AES in general ...

I'd assume there is a method for it. I'll need to look at the current AES implementation to see how they handle it. My gut feeling is that I/we should be able to update the AES-CTR-MT code to incorporate AES-NI once I figure that out.

As an update - I contacted the guy that did the original work on the AES-CTR-MT cipher a month or so ago. I just pinged him again to see if he had any thoughts on this.