Provide compression level for training dictionary
pkese opened this issue · 3 comments
Apparently, to get optimal performance when using dictionary, dictionary should be trained with the same compression level as the compression level when dictionary is going to be used.
Zstd's minimal search pattern size is dependent on compression level, e.g. if compression level is low, minimal pattern size is 4 bytes or more.
At higher compression levels Zstd will trade CPU time to search for patterns smaller than 4 bytes.
If dictionary is trained with low compression level then dictionary will contain only large patterns.
If then that dictionary is used for compressing actual data at high compression level,
then there won't be any less than 4-byte patterns in the dictionary so Zstd will do a lot of searching in vain.
Consequently Zstd will be wasting energy and dictionary won't be used as efficiently.
Please provide an option to parametrize compression level when training the dictionary.
ZstdSharp/src/ZstdSharp/Unsafe/Zdict.cs
Line 482 in 5bd8080
I'm measuring about 2.9% increase in compression rate with patch in #23 applied.
For test I'm compressing few tens of lines of short texts totaling 13206 bytes of raw text.
With dictionary trained at default compression level, that gets compressed down to 1560 bytes,
with dictionary trained at the same level as data compression, it comes down to 1515 bytes.
You can call ZDICT_optimizeTrainFromBuffer_fastCover
with any required parameters from your code and make your own safe wrapper for it. This is part of the unsafe public API of the ZstdSharp
library, which is the much same as the original zstd
library.
Safe wrappers such as DictBuilder has less functionality and is subject to extension.
Added TrainFromBufferFastCover method