AVX for 1KB input
Opened this issue · 10 comments
This sentence caught my interest as I need to work with exactly 1KB input :-):
There is no assembly routine for single-block compressions.
Is there any specific reason?
Basically: you can use SIMD for a hash function in two ways. Traditionally, you use it to parallelize the operations within a block. BLAKE3 is different: because it's a tree hash, you can instead parallelize across leaves of the tree. The latter is much more "embarrassingly parallel" than the former: it scales directly with the number of SIMD registers, and it's simpler to implement.
Furthermore, for small messages, you start having to worry about the overhead of warming up "cold" SIMD registers, and (in Go specifically) the overhead of calling into assembly instead of a potentially-inlineable function. So you get much more bang for your buck by parallelizing across leaves. It is certainly possible to parallelize within a block, and it would definitely speed things up for certain workloads; but it's unclear how much faster it would be, and that makes it hard for me to justify spending too much time and energy on it.
If you're curious, here's what the asm for intra-block SIMD looks like in (slightly modified) BLAKE2b: https://github.com/lukechampine/us/blob/master/merkle/blake2b/gen.go
Thanks for the valuable input. The 1KB case is so critical to me (I need to compute millions of them all the time) that I might decide to invest my time to do it despite I have zero experience in SIMD programming (but it might be yet another cool stuff to learn). Could you clarify couple of things?
My scenario is that I have thousands 1KB chunks spread across the memory of my program (it's a merkle tree). I need to compute separate blake3 checksums for all of them. As you mentioned I see two possibilities:
a. I may use AVX to parallelize blocks in single chunk
b. I may use AVX to compute hashes for many chunks at a time.
And both approaches are fine to me, obviously I prefer the one which allows me to compute more hashes per second.
You said that the AVX is better used to process 1KB chunks rather than 64-byte blocks. Are there any issues to consider with this approach?
As I understand, by using AVX-512 instructions I could process up to 16 chunks in single shot?
You said that some init step is required. If I compute the hashes all the time may I reserve the registers and initialize them once for the entire execution time of my app?
I encourage you to try it! Writing asm is much easier in Go nowadays thanks to Avo. I used Felix Cloutier's instruction reference to help understand what each of the SIMD instructions does.
I think your best bet is to modify the guts.compressChunksAVX*
assembly. All you need to do is set the FlagRoot
flag on the final block. So, where the existing assembly calls ORL(Imm(2)
, you want to do ORL(Imm(2 | 8)
instead. compressChunks
produces chaining values, not arbitrary-length hashes -- but I'm pretty sure they're equivalent if you only need the first 32 bytes. So then you just need to cast the [MaxSIMD][8]uint32
to [MaxSIMD][32]byte
and you should be good. I would test all this myself, but I don't have an AVX machine at the moment. 😅
Quick update...
So far I implemented all the operations required by blake3, for the purpose of verifying the right choice of AVX-512 instructions: addition, xor, right rotation by 7, 8, 12, and 16 bits.
And I run some benchmarks comparing those operations to pure-go implementations.
The idea is that using AVX instructions I am able to compute the results for 16 uint32 numbers together, while in go I need to do it one-by-one.
Here are the results:
BenchmarkAddGo-20 14914677 82.13 ns/op
BenchmarkAddAVX-20 29459727 40.56 ns/op
BenchmarkXorGo-20 14659716 86.61 ns/op
BenchmarkXorAVX-20 29494845 40.71 ns/op
BenchmarkRotateRight7Go-20 31895943 38.07 ns/op
BenchmarkRotateRight7AVX-20 33658654 38.61 ns/op
BenchmarkRotateRight8Go-20 30425626 38.11 ns/op
BenchmarkRotateRight8AVX-20 34230165 38.12 ns/op
BenchmarkRotateRight12Go-20 30466242 38.12 ns/op
BenchmarkRotateRight12AVX-20 33703388 38.13 ns/op
BenchmarkRotateRight16Go-20 30780806 38.07 ns/op
BenchmarkRotateRight16AVX-20 31285239 35.26 ns/op
Interesting facts:
- For add and xor, AVX is only twice better (compared to the 16 uint32 values being processed at once).
- For right rotation results are comparable, which is very interesting. Sadly this operation requires executing 3 AVX instructions.
I think that the power of AVX-512 instructions will show its potential once all the operations are combined in the full algorithm, rather than running them separately one by one. I guess the cost of loading data from memory into the registers and back is high.
Here you may check the code: https://github.com/outofforest/quantum/tree/51ce795a908a8addacf7d5ee529514097655f092/checksum
Indeed, massive speedup is visible when 10 steps are combined inside asm:
BenchmarkAddGo-20 16204963 81.70 ns/op
BenchmarkAddAVX-20 29706685 40.58 ns/op
BenchmarkAddAVX10-20 191069847 6.256 ns/op
BenchmarkXorGo-20 14851650 82.61 ns/op
BenchmarkXorAVX-20 29544925 40.70 ns/op
BenchmarkXorAVX10-20 190972988 6.261 ns/op
BenchmarkRotateRight7Go-20 30876564 38.37 ns/op
BenchmarkRotateRight7AVX-20 32420160 35.20 ns/op
BenchmarkRotateRight7AVX10-20 84815980 14.11 ns/op
BenchmarkRotateRight8Go-20 30728666 38.00 ns/op
BenchmarkRotateRight8AVX-20 33594930 35.36 ns/op
BenchmarkRotateRight8AVX10-20 83694783 14.07 ns/op
BenchmarkRotateRight12Go-20 30730410 38.25 ns/op
BenchmarkRotateRight12AVX-20 33705121 35.20 ns/op
BenchmarkRotateRight12AVX10-20 84024039 14.06 ns/op
BenchmarkRotateRight16Go-20 31537542 38.07 ns/op
BenchmarkRotateRight16AVX-20 33920329 35.25 ns/op
BenchmarkRotateRight16AVX10-20 84446478 14.07 ns/op
But still it seems that the right rotation is a heavy operation.
Ohh... (not so) surprisingy ChatGPT was wrong saying that there is no direct right rotation instruction for ZMM registers.
By using it instead of combination of right and left shifts the results are much better:
BenchmarkAddGo-20 16072462 78.80 ns/op
BenchmarkAddAVX-20 29598748 39.96 ns/op
BenchmarkAddAVX10-20 173327589 6.914 ns/op
BenchmarkXorGo-20 15111134 79.43 ns/op
BenchmarkXorAVX-20 30229264 39.30 ns/op
BenchmarkXorAVX10-20 173679774 6.904 ns/op
BenchmarkRotateRight7Go-20 17917561 66.44 ns/op
BenchmarkRotateRight7AVX-20 34277546 35.07 ns/op
BenchmarkRotateRight7AVX10-20 178934812 6.694 ns/op
BenchmarkRotateRight8Go-20 21752845 64.95 ns/op
BenchmarkRotateRight8AVX-20 33594286 35.47 ns/op
BenchmarkRotateRight8AVX10-20 179084152 6.620 ns/op
BenchmarkRotateRight12Go-20 31215721 39.81 ns/op
BenchmarkRotateRight12AVX-20 37514030 32.24 ns/op
BenchmarkRotateRight12AVX10-20 190661989 6.287 ns/op
BenchmarkRotateRight16Go-20 31802187 38.60 ns/op
BenchmarkRotateRight16AVX-20 38044963 32.07 ns/op
BenchmarkRotateRight16AVX10-20 190768456 6.329 ns/op
Yeah, you're gonna encounter a lot of overhead calling into asm functions in a tight loop. It's always better to write the entire loop in asm if you can.
Implemented blake3 version optimized for 1KB inputs and results are AMAZING (!!!).
Here is the benchmark for computing 16 checksums of 1KB inputs:
BenchmarkChecksum1KLuke-20 30471 38537 ns/op
BenchmarkChecksum1KZeebo-20 51097 23478 ns/op
BenchmarkChecksum1KAVX-20 355615 3365 ns/op
Nice! If you're able to share the code, I'd be curious to take a look :)