how to compare fast5 files
RichardCorbett opened this issue · 5 comments
Hi folks.
I'm very happy to see that you worked on this project and published it. I hope the folks in Oxford pick up some of what you showed and takes advantage of these findings.
I am running some tests using version 0.3.0.
I have the following datasets based on a single .fast5 file containing 4000 reads of genomic ONT promethion reads.
dataset | file method | file size (kb) |
---|---|---|
1 | original zlib fast5 | 3027586 |
2 | input: 1, f2s zlib record and svb-zd signal compression | 1839624 |
3 | input: 1, f2s zstd record and svb-zd signal compression | 1771464 |
4 | input: 2, s2f | 2701116 |
5 | input: 3 s2f | 2701116 |
The .fast5->blow5->.fast5 results in 4 and 5 resulted in the exact same file independent of the compression params, but both files differ in size from the original.
I've been poking around with h5diff to verify that everything in my original fast5 is still recovered in the resulting .fast5 after roundtripping through blow5. I can't seem to get it to report differences, or to confirm that the file contents are the same. Can you share the approach you used to compare .fast5 file contents? I also tried basecalling multiple times with guppy, but the non-deterministic manner in which the basecalling is done limits the utility here.
Also, is the size difference between 1, 4 and 5 possibly just due to a difference in compression? I don't see any information describing if the .fast5->blow5->.fast5 resulting file is zlib, vbz, or otherwise compressed.
thanks
Richard
Hello Richard,
Thank you for trying slow5tools!
There is no direct fast5 comparison method. We run guppy for the original and the s2fed fast5 and then, compare the outputs (fastq and sequencing summary files). This is the ultimate test we do before a release. (https://github.com/hasindu2008/slow5tools/blob/master/test/test_with_guppy.sh)
While developing, we compare the outputs of f2s and f2s->s2f->f2s. This should be run on a small dataset as it creates ASCII slow5. (https://github.com/hasindu2008/slow5tools/blob/master/test/test_f2s_s2f_integrity.sh)
The reason for the size reduction could be the 'unaccounted space' in hdf5 file format. When a fast5 file is updated after creation, garbage space is accumulated (https://docs.hdfgroup.org/hdf5/rfc/FileSpaceManagement.pdf). I assume the original fast5 files have more unaccounted space than the s2f output. You can check it with the following command.
h5stat -S [fast5 file]
For example, a dataset of size ~1.5TB had ~50GB of unaccounted space. If someone wants to get rid of this unaccounted space he has to use h5repack. As you have observed the same can now be done using s2f!
Please let us know how it goes.
Regards,
Hiruna
Wonderful thanks @hiruna72 .
I hadn't appreciated that the differences I see in repeated guppy runs are due to the sort order of the reads. When I re-sort I can confirm that the resulting .fastq from my original and blow5 cycled fast5 files are the same.
I ran h5stat -S
and got the following results:
Summary of file space information | original fast5 | blow5 default -> fast5 | blow5 vbz -> fast5 |
---|---|---|---|
File metadata | 69768286 | 95031240 | 95031240 |
Raw data | 2660055579 | 2660055579 | 2660055579 |
Amount/Percent of tracked free space | 0 | 0 | 0 |
Unaccounted space | 19923406 | 176 | 176 |
Total space | 2749747271 | 2755086995 | 2755086995 |
Disk space | 2.9Gb | 2.6Gb | 2.6Gb |
so if I'm interpreting this correctly, there is ~20Mb of unaccounted space in the original .fast5 file. However, it looks like I'm saving ~300Mb in disk space by cycling through blow5. Does this make sense to you?
Hi Richard,
Thank you for summarizing the results. It is really helpful.
I assume the Total space in the table is the sum of the Total spaces of each fast5 file (from h5stat).
How did you get the Disk space?
Hi @hiruna72, to get the disk usage of each file I ran du -sh
I am glad that you guys are trying it out and appreciating it. More and more users using it encourage us to further keep improving the tools and put more and more effort to make the tools even better. Thank you.
To add more to what @hiruna72 said, the HDF5/FAST5 is a very complex format and has a number of different storage allocation schemes and we do not exactly know what kind of parameters ONT's MINKNOW uses as it is closed source. So when generating the fast5s using slow5tools we use the default HDF5 allocation scheme. These file size differences are likely to be due to differences in those storage allocation parameters. If you have a dataset of relatively short reads (e.g., cDNA, virtual amplicons) convert to BLOW5 and you would see that the spacing saving becomes even higher and the converted back fast5 are significantly smaller than original FAST5.
Despite the differences in sizes, the ultimate test would be to basecall the original files and the reconverted files using Guppy and see if the diff passes on sorted fastqs and sorted sequencing summaries. If the diff passes it means all the raw signal data is saved without loss and we do not need to worry.
If you have any more questions or comments please let us know.