Malformed blow5 record - can they still be merged?
hengjwj opened this issue · 8 comments
Hi again @hasindu2008,
I encountered the following error when attempting to merge BLOW5s:
[slow5_get_next_mem::ERROR] Malformed blow5 record. Failed to read the record size. Missing blow5 end of file marker. At src/slow5.c:3236
This data came from ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR470/ERR4708848/HEK293T-Mettl3-KO-rep2.tar.gz (I just saw that this was also mentioned in #89). I split the FAST5s into batches of 50 files per batch and ran f2s and saw that only 7 FAST5s were lost so I proceeded to delete the FAST5s.
While f2s was running, I did observe several of these errors but thought the BLOW5 file would still be intact:
[f2s_child_worker::ERROR] Bad fast5: Could not read contents of the fast5 file 'FAK28957_4e4cc36706f219188246d6743803dc2e9ed55520_403.fast5'.
The command I used for f2s was:
slow5tools f2s -p $numcore $i -d ${i}_blow5
Can these BLOW5s still be merged?
Joel
Hi
If there was a error, slow5tools process terminates at that point and thus I highly recommend not trying to merge such files (there are hacky ways to do it, but I highly discourage it).
I downloaded the tar.gz file you mentioned to give it a go, but when I try to extract:
gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Did you get a similar error? The downloaded tar.gz is around 124GB in my case.
Perhaps compare md5sum hash to check the download isn't doing something strange.
@Psy-Fer Do you know how to get the MD5 from ENA? The link is https://www.ebi.ac.uk/ena/browser/view/ERR4708848.
Hi
If there was a error, slow5tools process terminates at that point and thus I highly recommend not trying to merge such files (there are hacky ways to do it, but I highly discourage it). I downloaded the tar.gz file you mentioned to give it a go, but when I try to extract:
gzip: stdin: invalid compressed data--crc error tar: Child returned status 1 tar: Error is not recoverable: exiting now
Did you get a similar error? The downloaded tar.gz is around 124GB in my case.
Yeah, I got a similar error but the authors said that it should be fine as the fast5 are still extracted.
Hey @hasindu2008
You can find it in the xml file at the bottom.
Should be
4f3f118f5ba809da987bbaf69edb8860
Thanks @Psy-Fer seems they match.
@hengjwj
I managed to convert that dataset. How I did was:
- extract the tarball into a directory named
fast5
- run slow5tools f2s and noted down the names of the FAST5 files that caused problems (Error messages in f2s)
- move those badFAST5 files from
fast5
directory to a separate directory calledquarantine
- deleted the blow5 files generated from step 2 and relaunched slow5tools f2s on cleaned up
fast5
directory - Merge the blow5 files from step 4 using slow5tools merge
There were only a handful of bad FAST5 files and if we do not care about those few thousand reads, all good. However, I wanted to rescue as much as possible from those corrupted FAST5. So I went along the following steps:
- Then I converted fast5 files in
quarantine
into single-read fast5 files using ONT's multi_to_single_fast5, into a directory calledq_single
- Ran f2s on
q_single
and noted down which single-read fast5 files that are bad - Delete those bad single-read fast5 files
- delete the blow5 files from step 8 and relaunched slow5tools f2s on cleaned up
q_single
directory - merge the slow5 files from step 10
- merge the two merged blow5 files from step 5 and step 11
Anyway, I uploaded the final BLOW5 files to https://slow5test.s3.amazonaws.com/HEK293T-Mettl3-KO-rep2.blow5 temporarily, so that you can download it and save your time.
Note that there are multiple ways to handle bad FAST5 files. The above method is what I felt like doing today. Some other ways are discussed at #89, some of which are easier than above.
@hengjwj If you are planning to convert some more datasets, first check if they are belonging to the https://github.com/GoekeLab/sg-nex-data project for which there are already converted BLOW5 files at http://sg-nex-data-blow5.s3-website-ap-southeast-1.amazonaws.com/.
https://slow5test.s3.amazonaws.com/HEK293T-Mettl3-KO-rep2.blow5
Will download asap. Thanks for generating the file and the guide!