crypto/aes: add assembly for non-AES-NI machines
Opened this issue · 27 comments
alexbrainman commented
golang https server described in https://golang.org/issue/4073?c=8 and tested with siege --benchmark --concurrent=100 "https://localhost:8082"; command gives Lifting the server siege... done. Transactions: 779 hits Availability: 100.00 % Elapsed time: 40.56 secs Data transferred: 0.02 MB Response time: 4.85 secs Transaction rate: 19.21 trans/sec Throughput: 0.00 MB/sec Concurrency: 93.11 Successful transactions: 779 Failed transactions: 0 Longest transaction: 10.22 Shortest transaction: 0.34 But nginx does better: Transactions: 5120 hits Availability: 100.00 % Elapsed time: 53.87 secs Data transferred: 0.74 MB Response time: 1.04 secs Transaction rate: 95.04 trans/sec Throughput: 0.01 MB/sec Concurrency: 98.92 Successful transactions: 5120 Failed transactions: 0 Longest transaction: 1.08 Shortest transaction: 0.15 hg id is 8e87cb8dca7d. windows/386. linux/386 golang server does about the same: Lifting the server siege... done. Transactions: 1867 hits Availability: 100.00 % Elapsed time: 118.75 secs Data transferred: 0.05 MB Response time: 6.23 secs Transaction rate: 15.72 trans/sec Throughput: 0.00 MB/sec Concurrency: 97.89 Successful transactions: 1867 Failed transactions: 0 Longest transaction: 13.59 Shortest transaction: 0.32 https://golang.org/issue/4073?c=6 claims similar results comparing to "hello world node.js app". I would investigate more, but I know nothing about SSL. Alex
alexbrainman commented
Issue #4073 has been merged into this issue.
agl commented
nginx is probably doing a less computationally intensive ciphersuite and it's using OpenSSL rather than math/big. If anyone is keen on improving things, a constant time modexp in math/big would go a long way, but it's very complex. We could also likely reduce the cost of RSA blinding by squaring the same manner as OpenSSL rather than generating a new blind each time. Lastly, a high-speed, constant time P-256 would also help.
Labels changed: added priority-later, removed priority-triage.
Status changed to LongTerm.
rsc commented
robpike commented
gopherbot commented
I was running into this from another direction, downloading over https, and wrote a program that demonstrates the issue, not through lower footprint but increased CPU utilization. The program is on the playground here: http://play.golang.org/p/chCbgqS_ls It is almost certainly possible to create a more straightforward example that passes data straight through the crypto functions without relying on the network, but I was trying to see if the issue was related to fetching multiple ssl streams at once or not and wanted to isolate this anyway. A sample pprof topN output on a linux-64bit machine: Total: 953 samples 472 49.5% 49.5% 472 49.5% crypto/aes.decryptBlockGo 302 31.7% 81.2% 785 82.4% crypto/cipher.(*cbcDecrypter).CryptBlocks 85 8.9% 90.1% 85 8.9% crypto/sha1.block 36 3.8% 93.9% 36 3.8% runtime.memmove 9 0.9% 94.9% 479 50.3% crypto/aes.decryptBlock 5 0.5% 95.4% 483 50.7% crypto/aes.(*aesCipher).Decrypt 4 0.4% 95.8% 4 0.4% runtime.futex 3 0.3% 96.1% 3 0.3% ifaceeq1 3 0.3% 96.4% 9 0.9% syscall.read 2 0.2% 96.6% 2 0.2% crypto/hmac.(*hmac).tmpPad 2 0.2% 96.9% 5 0.5% crypto/sha1.(*digest).Sum 2 0.2% 97.1% 2 0.2% netpollblock This doesn't appear to be limited to aes, as downloading from a different source (unfortunately an internal server) gave me this pprof output: Total: 1210 samples 813 67.2% 67.2% 813 67.2% crypto/des.permuteBlock 280 23.1% 90.3% 1016 84.0% crypto/des.feistel 31 2.6% 92.9% 67 5.5% compress/flate.(*compressor).deflate 21 1.7% 94.6% 1118 92.4% crypto/des.cryptBlock 8 0.7% 95.3% 1125 93.0% crypto/cipher.(*cbcDecrypter).CryptBlocks 6 0.5% 95.8% 6 0.5% compress/flate.(*compressor).findMatch 6 0.5% 96.3% 6 0.5% encoding/binary.bigEndian.PutUint64 5 0.4% 96.7% 19 1.6% compress/flate.(*huffmanEncoder).bitCounts 5 0.4% 97.1% 10 0.8% runtime.mallocgc 5 0.4% 97.5% 5 0.4% runtime.settype_flush This was with `go version devel +98b396da54db Sun Apr 28 00:18:11 2013 +1000 linux/amd64`
agl commented
jlmoiron: it's certainly true that the ciphers take CPU time. DES (in the second trace) is well known to be a terrible CPU hog. AES (the first trace) isn't quite so bad. Unfortunately your machine doesn't appear to have AES-NI support in the CPU so the AES code is pure Go and somewhat slow. I'd be happy to have optimised versions for various other CPUs but I'm afraid that we don't, yet.
gopherbot commented
Is this fixed by https://code.google.com/p/go/source/detail?r=57503accfdc7 now?
alexbrainman commented
Yes, go is much faster then before: go (go version devel +d2cb80eac1ac Sat Oct 05 14:15:02 2013 +1000 linux/386): Lifting the server siege... done. Transactions: 837 hits Availability: 100.00 % Elapsed time: 16.95 secs Data transferred: 0.02 MB Response time: 1.90 secs Transaction rate: 49.38 trans/sec Throughput: 0.00 MB/sec Concurrency: 94.06 Successful transactions: 837 Failed transactions: 0 Longest transaction: 4.01 Shortest transaction: 0.50 but still not as fast as nginx nginx: Lifting the server siege... done. Transactions: 1336 hits Availability: 100.00 % Elapsed time: 14.16 secs Data transferred: 0.19 MB Response time: 1.02 secs Transaction rate: 94.35 trans/sec Throughput: 0.01 MB/sec Concurrency: 96.06 Successful transactions: 1336 Failed transactions: 0 Longest transaction: 1.07 Shortest transaction: 0.12 Feel free to close this, if you think nothing else we can do here. Alex
minux commented
minux commented
alexbrainman commented
I don't know. Does this U:\>go tool pprof main.exe c:\tmp\a.pprof Welcome to pprof! For help, type 'help'. (pprof) top10 Total: 6211 samples 5461 87.9% 87.9% 5461 87.9% etext 135 2.2% 90.1% 135 2.2% math/big.addMulVVW 133 2.1% 92.2% 133 2.1% math/big.subVV 98 1.6% 93.8% 98 1.6% math/big.mulAddVWW 45 0.7% 94.5% 327 5.3% math/big.nat.divLarge 43 0.7% 95.2% 43 0.7% addroots 31 0.5% 95.7% 31 0.5% crypto/elliptic.p256ReduceDegree 23 0.4% 96.1% 23 0.4% runtime.settype_flush 21 0.3% 96.4% 21 0.3% markonly 20 0.3% 96.8% 34 0.5% crypto/elliptic.p256Mul (pprof) tell you anything? Alex
minux commented
alexbrainman commented
minux commented
agl commented
rsc commented
rsc commented
rsc commented
gopherbot commented
I've submitted a patch for constant time modular exponentiation (https://golang.org/cl/94850043/), however it's slower than normal big.Int.Exp, so I don't think it actually addresses the performance issue here.
gopherbot commented
On Linux, it might be worth considering using an AF_ALG socket and interacting with the kernel's crypto library. Instead of writing your own assembly, then you would get the benefit of automatic hardware acceleration if specialized instructions or chips exist. Here's more sample code for how to interact with AF_ALG on Linux (for cipher and hash algorithms) http://src.carnivore.it/users/common/af_alg with blog post http://carnivore.it/2011/04/23/openssl_-_af_alg Here's an example from Go for SHA1: https://github.com/jtolds/go-af-alg/blob/master/sha1/sha1_linux.go
gopherbot commented
taruti commented
gopherbot commented
cgo isn't required for AF_ALG with some minor syscall pkg changes (documented in that sha1_linux.go file I linked). And you're right, a context switch is way slower for small amounts of data. If the crypto package did end up using AF_ALG (my hope), I assume there'd be a threshold at which it is done natively in user space before it switches to syscalls.
marete commented
Not sure if this is specific to AES or if it is a CFB-mode issue, but: AES without AES-NI on a Nehalem machine in CFB mode gets one an astonishingly anaemic 16 Mbps (2 MB/s). This is especially significant because the openpgp standard (RFC 4880) requires all symmetric ciphers to be used in CFB mode. So, among other things, this makes the use of go.crypto/openpgp essentially impossible for large bulk encryptions when using AES 128 or 256 as the symmetric cipher.
yonderblue commented
I am running into this performance bottleneck (not tiny payloads) since the amd64 assembly is not present for ARM. Is this the proper ticket to prod? There was mention in the dev mailing list that arm64 AES-NI is coming for go1.7, how likely is that?
gopherbot commented
CL https://golang.org/cl/38366 mentions this issue.
yonderblue commented
Anyone planning on making progress on that CL linked above?