VGP/vgp-tools

Alternative libraries for VGPzip

jkbonfield opened this issue · 3 comments

If you are wedded to using Deflate, don't use Zlib as it's simply ancient technology. I'd advise libdeflate instead as generally it's over double the performance and produces compatible data streams. It also offers (at a CPU cost) higher compression levels than zlib if desired.

However better still IMO given this is a new proposal is to use Zstd instead. It's a better format than Deflate offering faster compression, decompression while being generally smaller. Basically it's a win-win-win.

(Better in terms of ratio is libbsc, but it has higher CPU so that's definitely a tradeoff and may not be approproate.)

For comparisons, see https://quixdb.github.io/squash-benchmark/unstable/ which shows the Pareto frontier. Obviously esoteric tools aren't appropriate, but it permits us to see how the standard well supported tools stack up against each other. Zstd covers quite a lot of the speed vs size tradeoffs optimally.

It supports both Zlib and Gzip encapsulation of the deflate specification. Infact libdeflate even comes with a gzip executable.

The main difference of libdeflate is the design is block based rather than a streaming with source/sink buffers. This means it can't do LZ compression between blocks of course, but this happens to fall neatly into our use case anyway.

An example of compressing a VCF using zlib vs libdeflate and then decompressing each others output.

$ time ./bgzip.libdeflate -@8 < /tmp/a.vcf > /tmp/a.vcf.libdeflate.gz
real	0m14.559s
user	1m40.961s
sys	0m10.068s

$ time ./bgzip.zlib -@8 < /tmp/a.vcf > /tmp/a.vcf.zlib.gz
real	0m33.221s
user	3m25.225s
sys	1m5.752s

-rw-r--r-- 1 jkb team117 18435621310 Nov 28 11:46 /tmp/a.vcf
-rw-r--r-- 1 jkb team117   349434708 Nov 28 11:49 /tmp/a.vcf.libdeflate.gz
-rw-r--r-- 1 jkb team117   341851007 Nov 28 11:50 /tmp/a.vcf.zlib.gz


$ time ./bgzip.zlib -t -@2 /tmp/a.vcf.libdeflate.gz
real	0m15.568s
user	0m32.444s
sys	0m2.804s

$ time ./bgzip.libdeflate -t -@2 /tmp/a.vcf.zlib.gz
real	0m11.812s
user	0m25.531s
sys	0m2.615s

# And for good measure, the system gzip utility vs libdeflate
$ time gzip -d < /tmp/a.vcf.libdeflate.gz > /dev/null
real	1m5.170s
user	1m4.730s
sys	0m0.400s

$ time ~/ftp/compression/libdeflate/gzip -d < /tmp/a.vcf.libdeflate.gz > /dev/null
real	0m20.680s
user	0m20.550s
sys	0m0.120s

Test machine was Ubuntu Bionic with 16x 2.6Gb Intel Broadwell CPUs.

I've no idea why the system gzip is so much slower than bgzip linked against the system zlib. Baffling.

Oh, and the benefits of not wedding ourselves to an ancient legacy format. The same file with the default zlib compression level:

$ time zstd < /tmp/a.vcf > /tmp/a.vcf.zstd
real	0m22.768s
user	0m15.274s
sys	0m21.140s

$time zstd -d < /tmp/a.vcf.zstd > /dev/null
real	0m8.803s
user	0m8.506s
sys	0m0.296s

-rw-r--r-- 1 jkb team117   341851007 Nov 28 11:50 /tmp/a.vcf.zlib.gz
-rw-r--r-- 1 jkb team117   227056198 Nov 28 12:00 /tmp/a.vcf.zstd

Or with comparable speed to libdeflate, turning up the compression level (it goes up to 22, but default is 3 I think):

$ time zstd -T8 -9 < /tmp/a.vcf > /tmp/a.vcf.zstd
real	0m13.803s
user	1m43.484s
sys	0m8.338s

-rw-r--r-- 1 jkb team117 188613044 Nov 28 12:05 /tmp/a.vcf.zstd

$ time zstd -d < /tmp/a.vcf.zstd > /dev/null
real	0m7.255s
user	0m6.915s
sys	0m0.329s