Decompression speed drops when using a shared dictionary on short documents

Question

Decompression speed drops when using a shared dictionary on short documents

pkolaczk opened this issue 2 months ago · 4 comments

Describe the bug
When I use a shared dictionary, the decompression speed significantly drops to the level much below when not using the dictionary, if compressing/decompressing small chunks a few kB each. The difference is bigger with stronger compression levels. The compression speed looks however unaffected. I'm not sure if it is a bug, but this looks quite disappointing considering that your website states that dictionaries should improve performance, especially for small messages.

I triple checked I initialize the DecoderDictionary only once and I reuse it for all blocks.
(See attached profile at the end which confirms 95%+ time is spent inside ZSTD decompression)

To Reproduce
Steps to reproduce the behavior:

Download a large enough csv file, e.g one of those is fine: https://www.ncei.noaa.gov/data/global-hourly/archive/csv/. I downloaded the pack for 1940.
Unpack it with your favorite tar.gz unpacker.
Install compresto by running:
cargo install compresto
Run a series of benchmarks without the dictionary:

% compresto benchmark-many /Users/piotr/Downloads/1940/72511514711.csv -a zstd -b 16384                           
zstd -c -7: 2296693 => 450566 (19.6 %), compression: 443.1 MB/s, decompression: 1196.4 MB/s
zstd -c -6: 2296693 => 421861 (18.4 %), compression: 649.8 MB/s, decompression: 2459.0 MB/s
zstd -c -5: 2296693 => 404779 (17.6 %), compression: 959.6 MB/s, decompression: 2430.7 MB/s
zstd -c -4: 2296693 => 373980 (16.3 %), compression: 931.8 MB/s, decompression: 1692.2 MB/s
zstd -c -3: 2296693 => 327156 (14.2 %), compression: 650.8 MB/s, decompression: 1736.6 MB/s
zstd -c -2: 2296693 => 282709 (12.3 %), compression: 683.7 MB/s, decompression: 1765.6 MB/s
zstd -c -1: 2296693 => 253982 (11.1 %), compression: 688.5 MB/s, decompression: 1851.4 MB/s
zstd -c 1: 2296693 => 194106 (8.5 %), compression: 602.2 MB/s, decompression: 1434.2 MB/s
zstd -c 2: 2296693 => 193427 (8.4 %), compression: 367.8 MB/s, decompression: 1636.6 MB/s
zstd -c 3: 2296693 => 206364 (9.0 %), compression: 231.9 MB/s, decompression: 1636.6 MB/s
zstd -c 4: 2296693 => 206359 (9.0 %), compression: 132.7 MB/s, decompression: 1762.2 MB/s
zstd -c 5: 2296693 => 195890 (8.5 %), compression: 85.5 MB/s, decompression: 1783.8 MB/s
zstd -c 6: 2296693 => 187577 (8.2 %), compression: 83.3 MB/s, decompression: 2185.3 MB/s
zstd -c 7: 2296693 => 186520 (8.1 %), compression: 50.6 MB/s, decompression: 2136.0 MB/s
zstd -c 8: 2296693 => 175292 (7.6 %), compression: 49.5 MB/s, decompression: 2140.4 MB/s
zstd -c 9: 2296693 => 175309 (7.6 %), compression: 90.3 MB/s, decompression: 2240.1 MB/s
zstd -c 10: 2296693 => 175521 (7.6 %), compression: 16.2 MB/s, decompression: 2312.4 MB/s
zstd -c 11: 2296693 => 175826 (7.7 %), compression: 15.5 MB/s, decompression: 2130.3 MB/s
zstd -c 12: 2296693 => 175836 (7.7 %), compression: 7.8 MB/s, decompression: 1928.6 MB/s

Generate the dictionary:

% zstd --train-fastcover /Users/piotr/Downloads/1940/72511514711.csv -o dictionary-csv -B16384
Trying 82 different sets of parameters                                         
k=50                                                                           
d=6
f=20
steps=40
split=75
accel=1
Save dictionary of size 79101 into file dictionary-csv

Repeat the benchmarks with the dictionary:

% compresto benchmark-many /Users/piotr/Downloads/1940/72511514711.csv -a zstd -b 16384 --dict dictionary-csv      
zstd -c -7: 2296693 => 312131 (13.6 %), compression: 371.4 MB/s, decompression: 861.0 MB/s
zstd -c -6: 2296693 => 299806 (13.1 %), compression: 475.9 MB/s, decompression: 993.2 MB/s
zstd -c -5: 2296693 => 290887 (12.7 %), compression: 505.4 MB/s, decompression: 1012.1 MB/s
zstd -c -4: 2296693 => 280730 (12.2 %), compression: 558.1 MB/s, decompression: 958.6 MB/s
zstd -c -3: 2296693 => 265723 (11.6 %), compression: 542.2 MB/s, decompression: 1158.2 MB/s
zstd -c -2: 2296693 => 256276 (11.2 %), compression: 632.8 MB/s, decompression: 1201.9 MB/s
zstd -c -1: 2296693 => 236373 (10.3 %), compression: 647.9 MB/s, decompression: 1309.7 MB/s
zstd -c 1: 2296693 => 214283 (9.3 %), compression: 639.2 MB/s, decompression: 1297.3 MB/s
zstd -c 2: 2296693 => 215959 (9.4 %), compression: 409.2 MB/s, decompression: 1290.2 MB/s
zstd -c 3: 2296693 => 208883 (9.1 %), compression: 303.0 MB/s, decompression: 820.6 MB/s
zstd -c 4: 2296693 => 208760 (9.1 %), compression: 318.3 MB/s, decompression: 759.1 MB/s
zstd -c 5: 2296693 => 192558 (8.4 %), compression: 195.5 MB/s, decompression: 601.3 MB/s
zstd -c 6: 2296693 => 180981 (7.9 %), compression: 137.6 MB/s, decompression: 555.3 MB/s
zstd -c 7: 2296693 => 165328 (7.2 %), compression: 117.2 MB/s, decompression: 1054.9 MB/s
zstd -c 8: 2296693 => 165212 (7.2 %), compression: 111.7 MB/s, decompression: 961.2 MB/s
zstd -c 9: 2296693 => 165395 (7.2 %), compression: 87.8 MB/s, decompression: 491.0 MB/s
zstd -c 10: 2296693 => 165782 (7.2 %), compression: 69.9 MB/s, decompression: 675.6 MB/s
zstd -c 11: 2296693 => 166528 (7.3 %), compression: 45.8 MB/s, decompression: 711.0 MB/s
zstd -c 12: 2296693 => 166522 (7.3 %), compression: 43.6 MB/s, decompression: 714.8 MB/s

Expected behavior
The Zstd website states that running with dictionary improves performance of compression and decompression. I expect the decompression speed is at least the same as without the dictionary.

Desktop

OS: macOS
Version: Sonoma 14.6.1
Compiler: rust 1.82.0, toolchain: stable-aarch64-apple-darwin
Flags: --release
Other relevant hardware specs: M2 Pro, 32 GB RAM
Build system: cargo 1.82.0

Additional context

Zstd v1.5.5
rust zstd bindings v0.13.2

This is visible only when compressing blocks of small size independently, e.g. 1 kB - 16 kB (use -b setting to change it). The phenomenon smoothly disappears as the block size gets larger.

Profile
Here is the profile I captured with Xcode Instruments:

Answer 1 · 2024-10-22T20:48:30.000Z

The scenario presented in the README page is reproducible, and can be checked locally
using the provided sample set.

For completeness, I re-ran this test on my local laptop (M1 Pro), and got the following result:

level	dec.Speed	w/Dictionary
1	557 MB/s	2200 MB/s
2	382 MB/s	2220 MB/s
3	383 MB/s	2345 MB/s
...	...	...
19	389 MB/s	2465 MB/s

So yes, the speed improvement is really impressive in this case.
Now, it doesn't mean that every dictionary decompression scenario will be similar to this one.
In fact, we have since been surprised to observe the very large variety of outcomes when employing dictionaries.

For the github_users dataset, speed is improved in part because compression ratio is also improved by a lot.
The general shape of the target data (a record) is fairly stable: many fields are the same, in the same order, and even their content is similar. Consequently, there are lots of fairly large segments which are constantly repeated across samples, and present in the dictionary.
This translates into great compression ratio improvements, which are achieved by using less and longer sequences in the compressed file. As decompression time is primarily driven by the number of sequences, less sequences translates into more speed.

In the csv scenario reported in this issue, the setup is quite different:
the origin source files are actually pretty large,
so they are then cut into blocks of 16 KB to artificially create lots of "small" files.
Then a dictionary is created from these blocks, and used for compression and decompression.

Thing is, since these blocks have been cut blindly at arbitrary boundaries, their content is not cleanly structured: a block may start in the middle of a field, or in the middle of a series of fields.
Also, note that 16 KB is already "not that small" (compared to roughly ~1 KB for the github scenario), and that's already a territory where dictionary benefits become lower.
That being said, since the csv file is essentially a series of rows with fairly redundant information and fields, dictionary is still useful.

So the reported csv scenario is structured in a way which makes dictionary less efficient.
Hence, the corresponding impact on decompression speed is also going to be less good.

Now, one could say "ok, dictionary is less efficient, but at least it's positive, so speed should be better ?".
Well, now we get into some complex territory.

In a normal decompression, there is only 1 buffer.
In term of pointer management, it's fairly simple: get the offset, subtract from current pointer, bound check, and go.
With dictionary decompression, there are now 2 buffers.
This leads to additional branches in the code.
Moreover, depending on the performance of the dictionary, this branch is going to be more or less predictable. The less predictable it is, the more it weighs down on performance.
Funny observation: if the dictionary is essentially useless, the branch becomes predictable, and all matches end up within the current buffer. In this case, performance cost is still there but it's much less. Unpredictability is really the larger cost here.

Long story short, one sequence during dictionary decompression costs more time than one sequence during normal single-buffer decompression.
So, for dictionary compression to compensate and actually produce decompression speed gains, it needs to compress data much better than it would have been without dictionary.

There are other advantages to dictionary decompression speed, such as pre-computed statistics, but this works better at small size (like ~1K) and low compression levels, because larger input sizes increase the odds that pre-computed statistics are too far away from optimal, and higher compression levels give more time for analysis and reaching that conclusion. Also the speed benefits of pre-computed statistics is inversely proportional to input size.

In short, speed benefits for dictionary compression should be expected for really small (~1KB) and structured data. Then it can be major. But for small-ish blocks (~16KB) with less structure, the conclusion can be different.

Edit :
I went ahead and tested the 72511514711.csv file, going straight to 1KB blocks, which is where dictionary compression should have the most impact.

And indeed, in this 1KB block scenario, the speed benefit of dictionary decompression is apparent :

level	dec.Speed	w/Dictionary
1	656 MB/s	1232 MB/s
2	492 MB/s	1200 MB/s
3	482 MB/s	1320 MB/s
...	...	...
19	508 MB/s	1776 MB/s

I also note that dictionary improves compression substantially in this scenario (from ~4x to ~11x), so this is a "good" case, and I would expect such a result to also translate into decompression speed benefits.

By 4KB, the speed benefit is still there, but much reduced :

level	dec.Speed	w/Dictionary
1	1197 MB/s	1607 MB/s
2	1149 MB/s	1585 MB/s
3	1139 MB/s	1722 MB/s
...	...	...
19	1428 MB/s	2212 MB/s

By 16KB, speed difference is mostly gone,
though interestingly, I also do not observe the speed loss mentioned above:

level	dec.Speed	w/Dictionary
1	1954 MB/s	2012 MB/s
2	1887 MB/s	2002 MB/s
3	1826 MB/s	1914 MB/s
...	...	...
19	2244 MB/s	2400 MB/s

The difference in observed relative performance, between this measurement and reported issue, could be down to architectural cpu differences (cache sizes notably).
Also, it's not obvious what compresto is doing exactly, and since devil is in the details, there might be some subtle inefficiencies weighing down the dictionary decompression scenario more (compared to the zstd internal benchmark).

Answer 2 · 2024-10-23T05:30:00.000Z

Hey! What a great write up! Thank you very much for taking the time to answer me and even running your own benchmarks!

So I conclude that I'm somehow using zstd incorrectly / unoptimally or the bindings are doing something weird.

Can you point me to:

an example showing which C API calls should I use to compress/decompress a single chunk with a dictionary so that I let zstd do as much work as possible at the initialization stage and reduce the work needed per each chunk of data; in the production system those chunks will be processed in random order
your benchmarking tool? Is it open source? If possible I'd like to run it myself and also compare what APIs it calls vs my program to find the root cause of discrepancy

Thank you!

Answer 3 · 2024-10-23T06:30:59.000Z

your benchmarking tool?

zstd -b

an example showing which C API calls

examples/dictionary_decompression.c

Answer 4 · 2024-10-23T10:19:45.000Z

Thank you zstd -b gives me at least the same performance numbers as compresto, without the dictionary. And I confirm the performance does not degrade when enabling dictionary. So there is definitely something fishy compresto does ;)