hasindu2008/slow5tools

BLOW5 file size is larger than FAST5 VBZ compressed

Closed this issue · 6 comments

Thanks for the cool tools! I think SLOW5 is a great idea and hope that ONT follows suit. I started to play around a little with slow5tools with the thought of converting my large number of projects from folders of FAST5 files to single BLOW5 format -- I initially thought I could save some space (it's mentioned a 25% reduction in file size).

This seems to be true for older FAST5 datasets, for example:

$ du -sh old_project_fast5
100G    old_project_fast5
$ du -sh old_project.blow5
93G     old_project.blow5

So here I saved 7 GB by converting to BLOW5, which isn't 25% but still decent.

I then noticed that the newer VBZ compressed files are actually quite a bit smaller than BLOW5, presumably due to the compression, for example:

$ du -sh newer_project_fast5_vbz
28G     newer_project_fast5_vbz
$ slow5tools f2s newer_project_fast5_vbz -d newer_project_blow5 -p 12
$ du -sh newer_project_blow5
36G     newer_project_blow5

In this case, converting the VBZ FAST5 files to BLOW5 format actually increased the size substantially. So it seems that the VBZ compression method that ONT has rolled out recently does a lot in saving space. Is it possible to use VBZ compression for BLOW5 format instead of zlib?

Hello,

yes, we are aware of this. We have been working on adding more compression methods, including something similar to VBZ. Look out for the new release coming soon.

Cheers,
James

hi @nextgenusfs

We have implemented some compression schemes similar to VBZ in the dev branch. VBZ is a combination of streambyte and facebook's zstd compression algorithms. SLOW5 specification supports multiple compressions, so we added streamvbyte and zstd support to slow5tools.

Could you test these now compression methods on your dataset? Please checkout to the "dev" branch and do a git pull --recurse-submodules. Then do a make zstd=1. You should have zstd development libraries installed (on ubuntu: apt-get install libzstd1-dev ).

Then to f2s and merge provide -c zstd -s svb-zd where -c is to specify zstd as the record compression and -s to specify streambyte as the signal compression. This would give you sizes smaller than fast5 compressed with vbz. You can also try -c zlib -s svb-zd which is a bit larger than zstd, but still smaller than -c zlib -s none (what you already got) with default options in the master branch.

About the 7% saving you observed as opposed to 25%:
This 25% was for a NA12878 promethION dataset we used for benchmarking. I recently benchmarked on a number of samples and there seem to be significant variability in space-saving. Following are some examples:

image

For some datasets, I observed like 12% saving only whereas for some datasets I observe a ridiculous 60%+ saving. FAST5 (HDF5) is an utterly complex file format and its internal space allocation strategies are behind this variability I believe.

Nice @hasindu2008! I'm on centos -- so can confirm the following has worked:

$ sudo yum install libzstd libzstd-devel
$ git clone https://github.com/hasindu2008/slow5tools.git
$ git checkout dev
$ git submodule update --init --recursive
$ make zstd=1

And then used it like this on same VBZ compressed FAST5 folder above:

/opt/apps/slow5tools/slow5tools f2s newer_project_vbz -d newer_project_blow5 -p 12 -c zstd -s svb-zd

And here are directory sizes:

$ du -sh newer_project_vbz 
28G     newer_project_vbz 
$ du -sh newer_project_blow5 /
25G     newer_project_blow5 /

So looks good that now smaller than FAST5 VBZ! Very nice.

I did get an error on merging from the new directory, pasting here in case it helps you.

$ /opt/apps/slow5tools/slow5tools merge -h
Merge multiple SLOW5/BLOW5 files to a single file
Usage: /opt/apps/slow5tools/slow5tools merge [OPTION]... [SLOW5_FILE/DIR]...

OPTIONS:
    --to=[FORMAT]                      specify output file format
    -c, --compress=[REC_METHOD]        specify record compression method -- zlib (only available for format blow5)
    -s, --sig-compress=[SIG_METHOD]    specify signal compression method -- none (only available for format blow5)
    -o, --output [FILE]                output contents to FILE [default: stdout]
    -l, --lossless [STR]               retain information in auxiliary fields during the conversion.[default: true].
    -t, --threads [INT]                number of threads [default: 4]
    -K --batchsize                     the number of records on the memory at once. [default: 4096]
    -h, --help                         display this message and exit
FORMATS:
    slow5
    blow5
REC_METHODS:
    none
    zlib
    zstd
SIG_METHODS:
    none
    svb-zd


$ /opt/apps/slow5tools/slow5tools merge -c zstd -s svb-zd -o newer_project.blow5 newer_project_blow5/
[merge_main] 122 files found - took 0.000s

[slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1157
[slow5_get_aux_enum_labels::ERROR] Exiting on error. At src/slow5.c:1157

@nextgenusfs

We just released slow5tools v0.3.0 a while ago with complete support for 'vbz' (zstd+svb) including bug fixes such as the one you encountered. Please give it a go - you can try the binaries.

By the way out of curiosity, what are the two datasets you were trying on - stats like DNA or RNA, library prep method, read-length etc you?

These were likely both DNA ligation kits (LSK109) run on R9.4.1 flow cells with read N50 lengths of ~ 20-40 KB -- these are sort of our average library metrics. I'll checkout the new binaries and let you know what level of compression I'm getting.

Hi @nextgenusfs I will close this issue for now. If you have any more troubles please feel free to reopen, Cheers!