netty/netty

Save Snappy's encode tmp table allocation

Closed this issue · 8 comments

is allocating a fresh new table in the hot path, see

public void encode(final ByteBuf in, final ByteBuf out, final int length) {
// Write the preamble length to the output buffer
for (int i = 0;; i ++) {
int b = length >>> i * 7;
if ((b & 0xFFFFFF80) != 0) {
out.writeByte(b & 0x7f | 0x80);
} else {
out.writeByte(b);
break;
}
}
int inIndex = in.readerIndex();
final int baseIndex = inIndex;
final short[] table = getHashTable(length);

It should be saved by using static final FastThreadLocal , given that it is used just temporary during encoding from the I/O event loop.

I'm taking a look at #13226 which is related, I can work on this as well

Another option could be to allow users to specify an allocator and make use a ByteBuf instead of a short[], but TBH this won't be super nice, because acquire/releasing ByteBufs isn't free nor manipulating them (due to accessibility checks)

@franz1981 another option would be to just save it in the Snappy instance if its not too big and reuse

Here the results with the "normal" execution on my machine

Benchmark                                         (bufferSizeInBytes)                    (hashType)   Mode  Cnt         Score         Error   Units
SnappyDirectBenchmark.encode                                     4096                     NEW_ARRAY  thrpt    3    593566.692 ±   76913.257   ops/s
SnappyDirectBenchmark.encode:compressedRatio                     4096                     NEW_ARRAY  thrpt    3         1.999 ±       0.008   ops/s
SnappyDirectBenchmark.encode                                     4096  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3    632241.984 ±   31584.847   ops/s
SnappyDirectBenchmark.encode:compressedRatio                     4096  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.999 ±       0.001   ops/s
SnappyDirectBenchmark.encode                                     2048                     NEW_ARRAY  thrpt    3   1105703.665 ±  180889.589   ops/s
SnappyDirectBenchmark.encode:compressedRatio                     2048                     NEW_ARRAY  thrpt    3         2.000 ±       0.009   ops/s
SnappyDirectBenchmark.encode                                     2048  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   1240283.412 ±   23526.154   ops/s
SnappyDirectBenchmark.encode:compressedRatio                     2048  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.999 ±       0.004   ops/s
SnappyDirectBenchmark.encode                                     1024                     NEW_ARRAY  thrpt    3   1891780.721 ±  201614.887   ops/s
SnappyDirectBenchmark.encode:compressedRatio                     1024                     NEW_ARRAY  thrpt    3         1.899 ±       0.007   ops/s
SnappyDirectBenchmark.encode                                     1024  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   2404584.135 ±   95665.279   ops/s
SnappyDirectBenchmark.encode:compressedRatio                     1024  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.899 ±       0.009   ops/s
SnappyDirectBenchmark.encode                                      512                     NEW_ARRAY  thrpt    3   2951070.746 ±  180055.758   ops/s
SnappyDirectBenchmark.encode:compressedRatio                      512                     NEW_ARRAY  thrpt    3         1.799 ±       0.007   ops/s
SnappyDirectBenchmark.encode                                      512  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   4700612.902 ±  104919.274   ops/s
SnappyDirectBenchmark.encode:compressedRatio                      512  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.799 ±       0.001   ops/s
SnappyDirectBenchmark.encode                                      256                     NEW_ARRAY  thrpt    3   8348209.803 ±  587891.640   ops/s
SnappyDirectBenchmark.encode:compressedRatio                      256                     NEW_ARRAY  thrpt    3         1.599 ±       0.008   ops/s
SnappyDirectBenchmark.encode                                      256  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   8717241.738 ± 2841145.484   ops/s
SnappyDirectBenchmark.encode:compressedRatio                      256  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.600 ±       0.007   ops/s
SnappyDirectBenchmark.encode                                      128                     NEW_ARRAY  thrpt    3  15383331.321 ±  117478.564   ops/s
SnappyDirectBenchmark.encode:compressedRatio                      128                     NEW_ARRAY  thrpt    3         1.199 ±       0.001   ops/s
SnappyDirectBenchmark.encode                                      128  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3  15137715.154 ±  128345.567   ops/s
SnappyDirectBenchmark.encode:compressedRatio                      128  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.199 ±       0.001   ops/s

And using FastThreads

Benchmark                                     (bufferSizeInBytes)                    (hashType)   Mode  Cnt         Score          Error  Units
SnappyDirectBenchmark.encode                                 4096                     NEW_ARRAY  thrpt    3    566358.838 ±     7532.810  ops/s
SnappyDirectBenchmark.encode:compressedRatio                 4096                     NEW_ARRAY  thrpt    3         1.999 ±        0.010  ops/s
SnappyDirectBenchmark.encode                                 4096  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3    611757.306 ±     1450.606  ops/s
SnappyDirectBenchmark.encode:compressedRatio                 4096  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.999 ±        0.010  ops/s
SnappyDirectBenchmark.encode                                 2048                     NEW_ARRAY  thrpt    3   1070361.595 ±    46964.258  ops/s
SnappyDirectBenchmark.encode:compressedRatio                 2048                     NEW_ARRAY  thrpt    3         1.999 ±        0.009  ops/s
SnappyDirectBenchmark.encode                                 2048  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   1195206.203 ±    62937.510  ops/s
SnappyDirectBenchmark.encode:compressedRatio                 2048  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.999 ±        0.011  ops/s
SnappyDirectBenchmark.encode                                 1024                     NEW_ARRAY  thrpt    3   1792458.611 ±   155324.629  ops/s
SnappyDirectBenchmark.encode:compressedRatio                 1024                     NEW_ARRAY  thrpt    3         1.899 ±        0.002  ops/s
SnappyDirectBenchmark.encode                                 1024  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   2287621.703 ±   215238.310  ops/s
SnappyDirectBenchmark.encode:compressedRatio                 1024  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.899 ±        0.004  ops/s
SnappyDirectBenchmark.encode                                  512                     NEW_ARRAY  thrpt    3   3039307.093 ±    54345.718  ops/s
SnappyDirectBenchmark.encode:compressedRatio                  512                     NEW_ARRAY  thrpt    3         1.799 ±        0.001  ops/s
SnappyDirectBenchmark.encode                                  512  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   4321904.402 ±   295391.220  ops/s
SnappyDirectBenchmark.encode:compressedRatio                  512  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.799 ±        0.007  ops/s
SnappyDirectBenchmark.encode                                  256                     NEW_ARRAY  thrpt    3   7610669.300 ±   983826.483  ops/s
SnappyDirectBenchmark.encode:compressedRatio                  256                     NEW_ARRAY  thrpt    3         1.599 ±        0.001  ops/s
SnappyDirectBenchmark.encode                                  256  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3   6392198.197 ± 14266003.360  ops/s
SnappyDirectBenchmark.encode:compressedRatio                  256  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.599 ±        0.001  ops/s
SnappyDirectBenchmark.encode                                  128                     NEW_ARRAY  thrpt    3  14527503.788 ±   118455.934  ops/s
SnappyDirectBenchmark.encode:compressedRatio                  128                     NEW_ARRAY  thrpt    3         1.199 ±        0.001  ops/s
SnappyDirectBenchmark.encode                                  128  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3  14551592.183 ±   158079.530  ops/s
SnappyDirectBenchmark.encode:compressedRatio                  128  FAST_THREAD_LOCAL_ARRAY_FILL  thrpt    3         1.199 ±        0.001  ops/s

I thought of using a system property because there seems to be a slight regression (between the error margin) when bufferSizeInBytes = 256. I need to investigate it further but I thought it was a good idea to open a PR to trigger the discussion about the implementation. In general it seems to be a good improvement but it's safer to put it behind a system property in order for the users to test it first

Please @lucamolteni can you share something using https://jmh.morethan.io/ charts to compare the 2 runs?
It would help to visualize what's going on and where 🙏

Additionally, I would just report the fast thread local ones given that is the most common use case we optimize for, which would reduce the numbers to look at

thought of using a system property because there seems to be a slight regression (between the error margin) when bufferSizeInBytes = 256. I need to investigate it further

I suggest to look at https://www.opsian.com/blog/jvms-allocateprefetch-options/
which show that 4 (AllocatePrefetchLines) * 64 bytes (AllocatePrefetchStepSize) = 256 bytes are prefetched - and this would impact the table allocation, mostly (if no thread local is used)

NOTE: search for AllocatePrefetch at https://chriswhocodes.com/oracle_jdk17_options.html as well

image

You can both -prof perfasm -prof perfnorm and compare cache misses by disabling prefetching (-XX:AllocatePrefetchStyle=0) or playing with AllocatePrefetchLines to extend it beyond 4 cache-lines (or reduce it).

With x86 the math is easy, with ARM i have no idea...

Screenshot 2024-03-04 at 18 01 49

This time the regression was on 128, but I'm not using FastThreads here.

This time the regression was on 128, but I'm not using FastThreads here

Which table size would have been? consider that a short is 2 bytes.
Which translates that something between 0 -> 256 bytes benefit from prefetching, on x86

This time the regression was on 128, but I'm not using FastThreads here

Which table size would have been? consider that a short is 2 bytes. Which translates that something between 0 -> 256 bytes benefit from prefetching, on x86

with bufferSizeInBytes=128 the hash table size will be 128