AVX for 1KB input

Question

AVX for 1KB input

Opened this issue 2 months ago · 10 comments

This sentence caught my interest as I need to work with exactly 1KB input :-):

There is no assembly routine for single-block compressions.

Is there any specific reason?

Answer 1 · 2024-10-31T14:42:08.000Z

Basically: you can use SIMD for a hash function in two ways. Traditionally, you use it to parallelize the operations within a block. BLAKE3 is different: because it's a tree hash, you can instead parallelize across leaves of the tree. The latter is much more "embarrassingly parallel" than the former: it scales directly with the number of SIMD registers, and it's simpler to implement.

Furthermore, for small messages, you start having to worry about the overhead of warming up "cold" SIMD registers, and (in Go specifically) the overhead of calling into assembly instead of a potentially-inlineable function. So you get much more bang for your buck by parallelizing across leaves. It is certainly possible to parallelize within a block, and it would definitely speed things up for certain workloads; but it's unclear how much faster it would be, and that makes it hard for me to justify spending too much time and energy on it.

If you're curious, here's what the asm for intra-block SIMD looks like in (slightly modified) BLAKE2b: https://github.com/lukechampine/us/blob/master/merkle/blake2b/gen.go

Answer 2 · 2024-10-31T15:27:16.000Z

Thanks for the valuable input. The 1KB case is so critical to me (I need to compute millions of them all the time) that I might decide to invest my time to do it despite I have zero experience in SIMD programming (but it might be yet another cool stuff to learn). Could you clarify couple of things?

My scenario is that I have thousands 1KB chunks spread across the memory of my program (it's a merkle tree). I need to compute separate blake3 checksums for all of them. As you mentioned I see two possibilities:
a. I may use AVX to parallelize blocks in single chunk
b. I may use AVX to compute hashes for many chunks at a time.

And both approaches are fine to me, obviously I prefer the one which allows me to compute more hashes per second.
You said that the AVX is better used to process 1KB chunks rather than 64-byte blocks. Are there any issues to consider with this approach?

As I understand, by using AVX-512 instructions I could process up to 16 chunks in single shot?
You said that some init step is required. If I compute the hashes all the time may I reserve the registers and initialize them once for the entire execution time of my app?

Answer 3 · 2024-10-31T17:57:37.000Z

I encourage you to try it! Writing asm is much easier in Go nowadays thanks to Avo. I used Felix Cloutier's instruction reference to help understand what each of the SIMD instructions does.

I think your best bet is to modify the guts.compressChunksAVX* assembly. All you need to do is set the FlagRoot flag on the final block. So, where the existing assembly calls ORL(Imm(2), you want to do ORL(Imm(2 | 8) instead. compressChunks produces chaining values, not arbitrary-length hashes -- but I'm pretty sure they're equivalent if you only need the first 32 bytes. So then you just need to cast the [MaxSIMD][8]uint32 to [MaxSIMD][32]byte and you should be good. I would test all this myself, but I don't have an AVX machine at the moment. 😅

Answer 4 · 2024-11-02T13:40:45.000Z

Quick update...

So far I implemented all the operations required by blake3, for the purpose of verifying the right choice of AVX-512 instructions: addition, xor, right rotation by 7, 8, 12, and 16 bits.

And I run some benchmarks comparing those operations to pure-go implementations.
The idea is that using AVX instructions I am able to compute the results for 16 uint32 numbers together, while in go I need to do it one-by-one.

Here are the results:

BenchmarkAddGo-20               	14914677	        82.13 ns/op
BenchmarkAddAVX-20              	29459727	        40.56 ns/op
BenchmarkXorGo-20               	14659716	        86.61 ns/op
BenchmarkXorAVX-20              	29494845	        40.71 ns/op
BenchmarkRotateRight7Go-20      	31895943	        38.07 ns/op
BenchmarkRotateRight7AVX-20     	33658654	        38.61 ns/op
BenchmarkRotateRight8Go-20      	30425626	        38.11 ns/op
BenchmarkRotateRight8AVX-20     	34230165	        38.12 ns/op
BenchmarkRotateRight12Go-20     	30466242	        38.12 ns/op
BenchmarkRotateRight12AVX-20    	33703388	        38.13 ns/op
BenchmarkRotateRight16Go-20     	30780806	        38.07 ns/op
BenchmarkRotateRight16AVX-20    	31285239	        35.26 ns/op

Interesting facts:

For add and xor, AVX is only twice better (compared to the 16 uint32 values being processed at once).
For right rotation results are comparable, which is very interesting. Sadly this operation requires executing 3 AVX instructions.

I think that the power of AVX-512 instructions will show its potential once all the operations are combined in the full algorithm, rather than running them separately one by one. I guess the cost of loading data from memory into the registers and back is high.

Here you may check the code: https://github.com/outofforest/quantum/tree/51ce795a908a8addacf7d5ee529514097655f092/checksum

Answer 5 · 2024-11-02T14:11:49.000Z

Indeed, massive speedup is visible when 10 steps are combined inside asm:

BenchmarkAddGo-20                 	16204963	        81.70 ns/op
BenchmarkAddAVX-20                	29706685	        40.58 ns/op
BenchmarkAddAVX10-20              	191069847	         6.256 ns/op
BenchmarkXorGo-20                 	14851650	        82.61 ns/op
BenchmarkXorAVX-20                	29544925	        40.70 ns/op
BenchmarkXorAVX10-20              	190972988	         6.261 ns/op
BenchmarkRotateRight7Go-20        	30876564	        38.37 ns/op
BenchmarkRotateRight7AVX-20       	32420160	        35.20 ns/op
BenchmarkRotateRight7AVX10-20     	84815980	        14.11 ns/op
BenchmarkRotateRight8Go-20        	30728666	        38.00 ns/op
BenchmarkRotateRight8AVX-20       	33594930	        35.36 ns/op
BenchmarkRotateRight8AVX10-20     	83694783	        14.07 ns/op
BenchmarkRotateRight12Go-20       	30730410	        38.25 ns/op
BenchmarkRotateRight12AVX-20      	33705121	        35.20 ns/op
BenchmarkRotateRight12AVX10-20    	84024039	        14.06 ns/op
BenchmarkRotateRight16Go-20       	31537542	        38.07 ns/op
BenchmarkRotateRight16AVX-20      	33920329	        35.25 ns/op
BenchmarkRotateRight16AVX10-20    	84446478	        14.07 ns/op

But still it seems that the right rotation is a heavy operation.

Answer 6 · 2024-11-02T15:17:14.000Z

Ohh... (not so) surprisingy ChatGPT was wrong saying that there is no direct right rotation instruction for ZMM registers.
By using it instead of combination of right and left shifts the results are much better:

BenchmarkAddGo-20                 	16072462	        78.80 ns/op
BenchmarkAddAVX-20                	29598748	        39.96 ns/op
BenchmarkAddAVX10-20              	173327589	         6.914 ns/op
BenchmarkXorGo-20                 	15111134	        79.43 ns/op
BenchmarkXorAVX-20                	30229264	        39.30 ns/op
BenchmarkXorAVX10-20              	173679774	         6.904 ns/op
BenchmarkRotateRight7Go-20        	17917561	        66.44 ns/op
BenchmarkRotateRight7AVX-20       	34277546	        35.07 ns/op
BenchmarkRotateRight7AVX10-20     	178934812	         6.694 ns/op
BenchmarkRotateRight8Go-20        	21752845	        64.95 ns/op
BenchmarkRotateRight8AVX-20       	33594286	        35.47 ns/op
BenchmarkRotateRight8AVX10-20     	179084152	         6.620 ns/op
BenchmarkRotateRight12Go-20       	31215721	        39.81 ns/op
BenchmarkRotateRight12AVX-20      	37514030	        32.24 ns/op
BenchmarkRotateRight12AVX10-20    	190661989	         6.287 ns/op
BenchmarkRotateRight16Go-20       	31802187	        38.60 ns/op
BenchmarkRotateRight16AVX-20      	38044963	        32.07 ns/op
BenchmarkRotateRight16AVX10-20    	190768456	         6.329 ns/op

Answer 7 · 2024-11-02T22:40:11.000Z

Yeah, you're gonna encounter a lot of overhead calling into asm functions in a tight loop. It's always better to write the entire loop in asm if you can.

Answer 8 · 2024-11-04T17:23:29.000Z

Implemented blake3 version optimized for 1KB inputs and results are AMAZING (!!!).
Here is the benchmark for computing 16 checksums of 1KB inputs:

BenchmarkChecksum1KLuke-20        	   30471	     38537 ns/op
BenchmarkChecksum1KZeebo-20       	   51097	     23478 ns/op
BenchmarkChecksum1KAVX-20         	  355615	      3365 ns/op

Answer 9 · 2024-11-05T01:46:18.000Z

Nice! If you're able to share the code, I'd be curious to take a look :)

Answer 10 · 2024-11-05T07:11:50.000Z

Sure. Here it is: https://github.com/outofforest/quantum/blob/a15890edc95e53d780ae9a5431bc7d827176f697/checksum/avo/avo.go