proposal: bytes.Xor
kixelated opened this issue · 30 comments
Variation of #28113 (rejected proposal)
I would like to export the optimized xorBytes
implementation in crypto/cipher
. Unlike the above ticket, I think it should be a member of the bytes
package as bytes.Xor
.
Applying a XOR over multiple bytes is commonly used in crypto as well as parity generation. It is was relatively trivial to implement yourself but thanks to a recent contribution, the Go version is now very optimized.
package bytes
// Xor will xor each byte in a and b, writing the result to dst.
// The destination should have enough space, otherwise Xor will panic.
// Returns the number of bytes written, which will be the minimum of len(a), and len(b).
func Xor(dst, a, b []byte) int
Currently, xorBytes
will panic when dst
is not large enough. It could follow copy
semantics and write up to min(len(dst), len(a), len(b))
bytes. But it seems relatively common to panic in the crypto
package when the dst buffer is not large enough, such as XORKeyStream
.
My use-case is for a webrtc library.
I'm currently profiling the application and a sizable amount of time is spent in NewCTR
. The SRTP spec states that each packet (<1400 bytes) uses a new CTR cipher based on the header information (it's careful to not reuse the same IV).
This line is one of my current bottlenecks, as it generates a slice of size 512 on the heap each time. NewCTR
accounts for 2.16% of the program's execution time according to pprof.
My plan was to write my own CTR class, potentially abusing the fact that AES has a fixed block size to allocate on the stack. However, I'm not able to go down this route without lifting the fast xorBytes
implementation from crypto/cipher
. It's probably not a blocker; I just don't think it should be necessary to copy the assembly optimized functions that are used extensively in the stdlib but are not exposed.
For now, I'm evaluating a few alternative approaches but all of them involve Go changes.
Later in the project, I will have to implement FEC. Forward error correction is performed by XORing multiple packets together (in a specific way) to create parity packets. This is similar to RAID5, which is another example of when a fast bytes.Xor
function would be desirable.
Also worth mentioning, this is similar to the purpose of the math/bits package. Go developers should not have to resort to assembly implementations for efficiency if the function is common and can be well defined.
This might be a better fit for golang.org/x/crypto(?).
cc @FiloSottile
Btw, Filippo, I see that the amd64 version uses SSE2. Any interest in me adding an AVX2 version?
@josharian Possibly, but I was thinking crypto/cipher
would import bytes
instead of having two disjoint implementations.
I think the chance of adding this to bytes is vanishingly small.
To ask some obvious questions: what about And? Or? Andnot? All three of those (but not Xor) are used in sometimes hot code in the compiler (bv.go). And package bytes is not really a catchall home for things related to byte slices; it’s more a sibling package to strings, which makes it more focused on textual tasks.
Yeah, that's fair. Adding it somewhere like golang.org/x/crypto
would help my situation so long as you guys are okay with the code duplication.
What about adding it to crypto/cypher
directly?
For now, I'm evaluating a few alternative approaches but all of them involve Go changes.
For the AES-CTR use case I think changing the Go standard library implementation of the CTR cipher mode would be preferable to adding any new APIs. It shouldn't be too complicated to make the buffer lazily allocated, i.e. allocated only if extra state is needed across XORKeyStream
calls. Note also that you probably want to stick with the Go standard library implementation to pick up any future assembly optimizations (see #20967).
@mundaym that's what I thought of doing too, but crypto/aes
would need access to xorBytes in crypto/cipher
. It would have to be moved to an internal package or made public somehow.
@kixelated I think it would be reasonable to put it in an internal package, probably under crypto/internal
. That is definitely preferable to a new public API.
The implementation of xorBytes
should be constant time for crypto uses (i.e. the execution time should be not depend on the data) so it might perhaps make sense for it to live in crypto/internal/subtle
rather than adding another internal package. We could then consider moving it to the crypto/subtle
package if we wanted to make it public later.
Like Josh said, there's basically no chance of this being bytes.Xor.
One option would be for the compiler to recognize
for i := range d {
d[i] = a[i] ^ b[i]
}
but even that seems too specialized. Probably better would be to start with a package x/crypto/bits
or something like that.
/cc @FiloSottile to see if we want to retitle as an x/crypto proposal or else close.
If it looks good to the compiler folks, I'd rather have it via pattern-matching than make a name-conflicting bits package for a single function.
@FiloSottile does it need to be constant time (per #30553 (comment))? If so it seems like it belongs in subtle and definitely not in the compiler. (There’s nothing about the Go code that is necessarily constant time—if a compiler could prove that b is all zero, then it a no-op.)
Assuming that that is ok, I don’t personally object to the pattern matching. It’s easy enough to implement and maintain, and though it does contribute a little bit to binary bloat, it is nice to have obvious, clear Go code be fast.
P.S. Did you want an AVX2 version?
@FiloSottile does it need to be constant time (per #30553 (comment))? If so it seems like it belongs in subtle and definitely not in the compiler. (There’s nothing about the Go code that is necessarily constant time—if a compiler could prove that b is all zero, then it a no-op.)
Hmm, that's a good point. I don't actually see how the compiler would know anything about cipher streams, but this is looking for trouble.
Let's start it in golang.org/x/crypto/internal/subtle, and we can move it to crypto/subtle if it works out.
P.S. Did you want an AVX2 version?
I don't know, does it move the needle on benchmarks of higher level primitives, or lets me remove some assembly elsewhere? I don't really ever need XOR by itself, so if it's not taking significant time already, I'm ok with it as is.
But also, there’s no reason we can’t remove that CTR bottleneck for the common case, so let’s do that.
Yeah, I've been messing around with some aes.NewCTR
benchmarks. It requires moving the xorBytes code to an internal package but there are some improvements you can make with the fixed block size. Nothing major but it's a pretty popular cipher suite, so it's probably worth it.
I'll run some quick benchmarks with a specialized AES-CTR class. I also want to compare it against a hypothetical ctr.Reset(iv []byte)
function that would let me reuse objects, although afaik there shouldn't be too much difference compared to the specialized AES-CTR class if it is allocated on the stack.
Okay, I wrote some benchmarks for more information:
NewAESCTR
just creates a CTR cipher.AESCTRRTP
will create a NewCTR cipher and encrypt 1400 bytes (my use case).
The old code uses cipher.NewCTR
which allocates ctr
and out
on the heap. The new code uses aesCipherAsm.NewCTR
which reduces allocations and a lot of logic because of the fixed BlockSize
.
name old time/op new time/op delta
NewAESCTR-8 157ns ± 2% 113ns ± 1% -27.93% (p=0.000 n=10+10)
AESCTR1K-8 1.62µs ± 4% 1.56µs ± 1% -3.47% (p=0.000 n=10+10)
AESCTR8K-8 12.9µs ± 2% 12.5µs ± 2% -2.88% (p=0.000 n=8+10)
AESCTRRTP-8 2.56µs ± 2% 2.46µs ± 1% -3.97% (p=0.000 n=9+10)
name old alloc/op new alloc/op delta
NewAESCTR-8 608B ± 0% 576B ± 0% -5.26% (p=0.000 n=10+10)
AESCTR1K-8 0.00B 0.00B ~ (all equal)
AESCTR8K-8 0.00B 0.00B ~ (all equal)
AESCTRRTP-8 608B ± 0% 576B ± 0% -5.26% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
NewAESCTR-8 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
AESCTR1K-8 0.00 0.00 ~ (all equal)
AESCTR8K-8 0.00 0.00 ~ (all equal)
AESCTRRTP-8 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
name old speed new speed delta
AESCTR1K-8 629MB/s ± 4% 651MB/s ± 1% +3.56% (p=0.000 n=10+10)
AESCTR8K-8 636MB/s ± 2% 655MB/s ± 2% +2.96% (p=0.000 n=8+10)
AESCTRRTP-8 546MB/s ± 2% 569MB/s ± 1% +4.13% (p=0.000 n=9+10)
+3-4% speedup is nothing to scoff at. I'll mess around with a few more things, like seeing if we really need to encrypt the counter 512 bytes at a time.
Oh and here's comparing NewCTR
versus Reset
after encrypting 1400 bytes.
BenchmarkAESCTRRTP-8 500000 2454 ns/op 570.43 MB/s 576 B/op 1 allocs/op
BenchmarkAESCTRRTPReset-8 1000000 2334 ns/op 599.82 MB/s 0 B/op 0 allocs/op
Reset
just sets the counter to the given iv.
P.S. Did you want an AVX2 version?
I don't know, does it move the needle on benchmarks of higher level primitives, or lets me remove some assembly elsewhere?
It wouldn't let us remove assembly, because AVX2 is not available on all amd64 chips that Go supports. It would speed things up, but perhaps not enough to care. I ran this:
$ go test -bench=AES -cpuprofile=c.p -count=3 crypto/cipher
and then pprof and got:
(pprof) top10
Showing top 10 nodes out of 52
flat flat% sum% cum cum%
11.68s 18.99% 18.99% 11.68s 18.99% crypto/aes.encryptBlockAsm
10.24s 16.65% 35.64% 10.24s 16.65% crypto/aes.gcmAesDec
9.20s 14.96% 50.59% 9.20s 14.96% crypto/aes.gcmAesEnc
7.03s 11.43% 62.02% 7.03s 11.43% crypto/aes.gcmAesData
4.27s 6.94% 68.96% 12.09s 19.66% crypto/cipher.(*cfb).XORKeyStream
2.13s 3.46% 72.43% 2.13s 3.46% runtime.nanotime
2.10s 3.41% 75.84% 14.61s 23.75% crypto/aes.(*aesCipherAsm).Encrypt
1.82s 2.96% 78.80% 3.57s 5.80% crypto/cipher.xorBytes
1.75s 2.85% 81.65% 8.08s 13.14% crypto/cipher.(*ctr).refill
1.74s 2.83% 84.47% 1.74s 2.83% crypto/cipher.xorBytesSSE2
Assuming an ~2x speedup from SSE2 to AVX2, it looks like this would be a ~1.4% speed-up.
So I'm guessing the answer is no. But now I can stop wondering, and the rest of this conversation can resume. :)
How about adding a new package "math/bits/slices" that has the functions Xor, And, Or, Andnot, and maybe a few more that operate over byte slices, or why not, also over all sorts of uintX slices, and that may have platform dependent assembly language implementation for them? As @josharian indicated, the go compiler uses such functionality itself, and I can see how these functions would be useful for cryptography but also for image processing. So I'd say that these functions would be of general use.
This seems like a useful third-party package, but I don't yet see a clear reason to put it into the standard library. Perhaps it could go into the x repo at some point. That package can start with a copy of the existing code in crypto/cipher, which isn't all that much.
I would totally be fine with copying the fast xor code to golang.org/x/crypto
or similar. It's really up to the team to decide if it would be a burden to maintain the same code in two repositories. Could crypto/cipher
import golang.org/x/crypto/bytes
(or whatever the name should be)?
I've written a CTR package for my application with a Reset(iv []byte)
method. It's a 5% speedup when encrypting 1k size packets because I can reuse the cipher, avoiding allocating ~600 bytes on the heap for each NewCTR
call.
My package requires the fast XorBytes
function otherwise it's 40% slower. I can't ship with a copy of Go's assembly implementation because of licensing; plus it would be excessive.
I can't ship with a copy of Go's assembly implementation because of licensing; plus it would be excessive.
Go’s license is very permissive. I’m surprised to hear that that is an issue. And fwiw, I’d consider copy/pasting such code a pretty reasonable interim fix under the circumstances—huge win, not all that much code, low refactoring risk (just copy/paste), pretty stable source with lots of miles on it. Just my 2c.
I'm contributing to a project using the MIT license, so I assumed it wouldn't be compatible but you're right. I'll make a pull request and see what they think of a package containing the Go license plus files with the Go copyright.
I'll still need xorBytes
(or another copy of the assembly code) when I get around to implementing FEC (parity packets) but yeah this solution seems okay.
Seems like this is happening outside the bytes
package, so closing per discussion with @golang/proposal-review.
Where was this merged into after all?
@Kargakis I'm not sure I understand your question. This proposal was declined.
OK, thanks, I don't know what the status of that is.