Add support for zstd-compressed corpora
danielmitterdorfer opened this issue · 0 comments
danielmitterdorfer commented
Rally supports various compression formats such as gz or bzip. It does not support the zstd format which is perfoming significantly better in disk usage and decompression speed in my experiments. I've compressed 183GB corpus with pbzip2
and pzstd
, both with the maximum compression level that is supported by the respective tool.
Format | Size on disk [GB] | Size on disk [GB] | Relative size [%] |
---|---|---|---|
bzip | 18613471805 | 18 | 100 |
zstd | 11215205385 | 11 | 60 |
Also decompression speed is vastly superior (times measured with time
, table contains the output of real
, i.e. wall clock time):
Format | Time to decompress [s] | Relative time [%] |
---|---|---|
bzip | 388 | 100 |
zstd | 144 | 36 |
Therefore I propose to add support for zstd compression to Rally similar to bzip support: The fast option would require pzstd
to be on PATH
and a fallback can be based on the Python zstd implementation.
For reference:
- Compress data:
pzstd -19 corpus.json -o corpus.json.zstd
(19
denotes the maximum compression level) - Decompress data:
pzstd -d corpus.json.zstd -o corpus.json