Save Snappy's encode tmp table allocation
Closed this issue · 8 comments
It should be saved by using static final FastThreadLocal
, given that it is used just temporary during encoding from the I/O event loop.
I'm taking a look at #13226 which is related, I can work on this as well
Another option could be to allow users to specify an allocator and make use a ByteBuf
instead of a short[]
, but TBH this won't be super nice, because acquire/releasing ByteBuf
s isn't free nor manipulating them (due to accessibility checks)
@franz1981 another option would be to just save it in the Snappy
instance if its not too big and reuse
Here the results with the "normal" execution on my machine
Benchmark (bufferSizeInBytes) (hashType) Mode Cnt Score Error Units
SnappyDirectBenchmark.encode 4096 NEW_ARRAY thrpt 3 593566.692 ± 76913.257 ops/s
SnappyDirectBenchmark.encode:compressedRatio 4096 NEW_ARRAY thrpt 3 1.999 ± 0.008 ops/s
SnappyDirectBenchmark.encode 4096 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 632241.984 ± 31584.847 ops/s
SnappyDirectBenchmark.encode:compressedRatio 4096 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.999 ± 0.001 ops/s
SnappyDirectBenchmark.encode 2048 NEW_ARRAY thrpt 3 1105703.665 ± 180889.589 ops/s
SnappyDirectBenchmark.encode:compressedRatio 2048 NEW_ARRAY thrpt 3 2.000 ± 0.009 ops/s
SnappyDirectBenchmark.encode 2048 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1240283.412 ± 23526.154 ops/s
SnappyDirectBenchmark.encode:compressedRatio 2048 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.999 ± 0.004 ops/s
SnappyDirectBenchmark.encode 1024 NEW_ARRAY thrpt 3 1891780.721 ± 201614.887 ops/s
SnappyDirectBenchmark.encode:compressedRatio 1024 NEW_ARRAY thrpt 3 1.899 ± 0.007 ops/s
SnappyDirectBenchmark.encode 1024 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 2404584.135 ± 95665.279 ops/s
SnappyDirectBenchmark.encode:compressedRatio 1024 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.899 ± 0.009 ops/s
SnappyDirectBenchmark.encode 512 NEW_ARRAY thrpt 3 2951070.746 ± 180055.758 ops/s
SnappyDirectBenchmark.encode:compressedRatio 512 NEW_ARRAY thrpt 3 1.799 ± 0.007 ops/s
SnappyDirectBenchmark.encode 512 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 4700612.902 ± 104919.274 ops/s
SnappyDirectBenchmark.encode:compressedRatio 512 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.799 ± 0.001 ops/s
SnappyDirectBenchmark.encode 256 NEW_ARRAY thrpt 3 8348209.803 ± 587891.640 ops/s
SnappyDirectBenchmark.encode:compressedRatio 256 NEW_ARRAY thrpt 3 1.599 ± 0.008 ops/s
SnappyDirectBenchmark.encode 256 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 8717241.738 ± 2841145.484 ops/s
SnappyDirectBenchmark.encode:compressedRatio 256 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.600 ± 0.007 ops/s
SnappyDirectBenchmark.encode 128 NEW_ARRAY thrpt 3 15383331.321 ± 117478.564 ops/s
SnappyDirectBenchmark.encode:compressedRatio 128 NEW_ARRAY thrpt 3 1.199 ± 0.001 ops/s
SnappyDirectBenchmark.encode 128 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 15137715.154 ± 128345.567 ops/s
SnappyDirectBenchmark.encode:compressedRatio 128 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.199 ± 0.001 ops/s
And using FastThreads
Benchmark (bufferSizeInBytes) (hashType) Mode Cnt Score Error Units
SnappyDirectBenchmark.encode 4096 NEW_ARRAY thrpt 3 566358.838 ± 7532.810 ops/s
SnappyDirectBenchmark.encode:compressedRatio 4096 NEW_ARRAY thrpt 3 1.999 ± 0.010 ops/s
SnappyDirectBenchmark.encode 4096 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 611757.306 ± 1450.606 ops/s
SnappyDirectBenchmark.encode:compressedRatio 4096 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.999 ± 0.010 ops/s
SnappyDirectBenchmark.encode 2048 NEW_ARRAY thrpt 3 1070361.595 ± 46964.258 ops/s
SnappyDirectBenchmark.encode:compressedRatio 2048 NEW_ARRAY thrpt 3 1.999 ± 0.009 ops/s
SnappyDirectBenchmark.encode 2048 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1195206.203 ± 62937.510 ops/s
SnappyDirectBenchmark.encode:compressedRatio 2048 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.999 ± 0.011 ops/s
SnappyDirectBenchmark.encode 1024 NEW_ARRAY thrpt 3 1792458.611 ± 155324.629 ops/s
SnappyDirectBenchmark.encode:compressedRatio 1024 NEW_ARRAY thrpt 3 1.899 ± 0.002 ops/s
SnappyDirectBenchmark.encode 1024 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 2287621.703 ± 215238.310 ops/s
SnappyDirectBenchmark.encode:compressedRatio 1024 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.899 ± 0.004 ops/s
SnappyDirectBenchmark.encode 512 NEW_ARRAY thrpt 3 3039307.093 ± 54345.718 ops/s
SnappyDirectBenchmark.encode:compressedRatio 512 NEW_ARRAY thrpt 3 1.799 ± 0.001 ops/s
SnappyDirectBenchmark.encode 512 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 4321904.402 ± 295391.220 ops/s
SnappyDirectBenchmark.encode:compressedRatio 512 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.799 ± 0.007 ops/s
SnappyDirectBenchmark.encode 256 NEW_ARRAY thrpt 3 7610669.300 ± 983826.483 ops/s
SnappyDirectBenchmark.encode:compressedRatio 256 NEW_ARRAY thrpt 3 1.599 ± 0.001 ops/s
SnappyDirectBenchmark.encode 256 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 6392198.197 ± 14266003.360 ops/s
SnappyDirectBenchmark.encode:compressedRatio 256 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.599 ± 0.001 ops/s
SnappyDirectBenchmark.encode 128 NEW_ARRAY thrpt 3 14527503.788 ± 118455.934 ops/s
SnappyDirectBenchmark.encode:compressedRatio 128 NEW_ARRAY thrpt 3 1.199 ± 0.001 ops/s
SnappyDirectBenchmark.encode 128 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 14551592.183 ± 158079.530 ops/s
SnappyDirectBenchmark.encode:compressedRatio 128 FAST_THREAD_LOCAL_ARRAY_FILL thrpt 3 1.199 ± 0.001 ops/s
I thought of using a system property because there seems to be a slight regression (between the error margin) when bufferSizeInBytes
= 256. I need to investigate it further but I thought it was a good idea to open a PR to trigger the discussion about the implementation. In general it seems to be a good improvement but it's safer to put it behind a system property in order for the users to test it first
Please @lucamolteni can you share something using https://jmh.morethan.io/ charts to compare the 2 runs?
It would help to visualize what's going on and where 🙏
Additionally, I would just report the fast thread local
ones given that is the most common use case we optimize for, which would reduce the numbers to look at
thought of using a system property because there seems to be a slight regression (between the error margin) when bufferSizeInBytes = 256. I need to investigate it further
I suggest to look at https://www.opsian.com/blog/jvms-allocateprefetch-options/
which show that 4 (AllocatePrefetchLines
) * 64 bytes (AllocatePrefetchStepSize
) = 256 bytes
are prefetched - and this would impact the table allocation, mostly (if no thread local is used)
NOTE: search for AllocatePrefetch
at https://chriswhocodes.com/oracle_jdk17_options.html as well
You can both -prof perfasm -prof perfnorm
and compare cache misses by disabling prefetching (-XX:AllocatePrefetchStyle=0
) or playing with AllocatePrefetchLines
to extend it beyond 4 cache-lines (or reduce it).
With x86 the math is easy, with ARM i have no idea...
This time the regression was on 128, but I'm not using FastThreads here
Which table size would have been? consider that a short
is 2 bytes.
Which translates that something between 0 -> 256 bytes benefit from prefetching, on x86
This time the regression was on 128, but I'm not using FastThreads here
Which table size would have been? consider that a
short
is 2 bytes. Which translates that something between 0 -> 256 bytes benefit from prefetching, on x86
with bufferSizeInBytes=128 the hash table size will be 128