Errors when convert fast5 to slow5
loganylchen opened this issue · 16 comments
Hi @hasindu2008 ,
Me again. Recently, I have been investigating some public Nanopore DRS data. According to your suggestions, I first converted the raw FAST5 files into SLOW5. But I encountered an error when I did this.
# cmd
slow5tools f2s data/hek293tMettl3KORep3/fast5/workspace -d results/slow5/hek293tMettl3KORep3 -p 8 2>logs/fast5toslow5/hek293tMettl3KORep3.log
I have to say, when I used GUPPY to re-basecall the FAST5, GUPPY also raised some ERRORs saying that they can't read the dataset for read_id:xxxx
. I think, this may be caused by the RAW FAST5 files. So I did the conversion by using the GUPPY generate FAST5 files (by using --fast5_out
when did the basecalling).
I am asking if I could ignore these error reads or fast5 files when doing the conversion, or if is there better way to fix this kind of issue?
Best
Hi, can I know the original source of this data?
If it is from https://github.com/GoekeLab/sg-nex-data, may I suggest to download directly from the BLOW5 bucket here http://sg-nex-data-blow5.s3-website-ap-southeast-1.amazonaws.com?
There were quite a few FAST5 corruptions in some datasets and so many inconsistencies in this original FAST5. The conversion to BLOW5, I had to do it in a complex way and also needed some manual curation. So, you could save a lot of your time by downloading them.
Thanks to cying111, jonathangoeke and AWS, those datasets in BLOW5 files have been hosted on the AWS public bucket.
Hi @hasindu2008 ,
I think some of them are coming from SG-NEx project, but some of them are not. What I've used is this one HEK293T-Mettl3-KO-rep2. It is not a standard cell condition.
The behaviour of slow5tools is to be conservative and error out when a FAST5 file is corrupted. There is no option at the moment to ignore corrupted reads, first too much implementation effort required, second, such corrupted files should not have been written on the first place and third, A two step-process is better in cases like this (first run some sort of sanitiser/cleaner to get rid of corrupted reads, then running slow5tools to convert).
When I was converting SG-Nex datasets, what i did was opening the HDF5 file using HDF5view, manually locate the affected read and delete the entry from the FAST5 file. Luckily only around 50 such manual cleaning up had to be done.
But seems like running Guppy fast basecalling with --fast5_out removes such corrupted reads? Did it work for you? If so, in the merged final BLOW5 file, run slow5tools stats <file.blow5> and see if the total number of reads match the total number of reads in the basecalled reads/FAST5 files?
Och! Thanks for your suggestions; I may do it following your way.
I tried the GUPPY with --fast5_out, but the errors are still there. I think GUPPY doesn't remove them but retains them there without basecalled information. It is my best luck If I can directly use your prepared BLOW5 file in the SG-NEx project, but for other datasets, I may do it by myself in such a painful way.
@loganylchen but given that seems to be a common problem, I wonder if we should programmatically handle this. @Psy-Fer and @hiruna72 any thoughts on this?
By default it should fail and show the error. But a flag to allow slow5tools to skip corrupted reads. If we can get the readIDs of corrupted reads as we go and dump them somewhere for investigation later, even better.
At the very least, people can use this method to clean up their fast5 files.
@hiruna72 can comment, on how difficult it is to implement this option (in a separate branch called "bad5" for now).
Yes, a fix should be possible without much difficulty. Will post here once implemented. @Psy-Fer yes that is a good idea. A file with the corrupted read_id and the original file path can be written at the output.
Any ideas on what the flag could be? --bad5 ['error', 'skip', 'log']
where error
is default value to error out on a bad fast5 record. skip
just skips them and maybe dumps in stderr, and log
puts it into a file? or something like that.
@loganylchen The dataset you sent us is still downloading and will take about five more days.
If you got time can you build from bad_fast5
branch and run f2s
on the failed dataset? This time use --bad5 1
to skip corrupted files.
Thanks.
@hiruna72 I've tried the bad_fast5 branch with --bad5 1
, and it works. It can go through all the fast5 files and complete the conversion. Do you need more testing before merging it into the main branch?
@loganylchen the issue with hiruna's method is, if a multi-fast5 has at least 1 bad read, it just gets rid of the whole multi-fast5 file which is not optimal.
If you are unhappy with losing those reads (I am), I also recently wrote a script (which is very inefficient though) for handling this:
two scripts are involved:
- https://github.com/hasindu2008/slow5tools/blob/master/scripts/mixed-multi-fast5-to-blow5.sh
- attached below (rename to .sh extension)
clean-crappy-single-fast5-aggressive2.sh.txt
Now on the input multi-fast5 directory (20190805_Sho_M3-1/ in this case):
./mixed-multi-fast5-to-blow5.sh 20190805_Sho_M3-1/ #run this and expect this to fail
rm -r tmp_blow5/ tmp_single_fast5/ #remove temporary files created which we do not need
mv tmp_fast5 fast5 #contains the converted and classified single FAST5 based on runid
./clean-crappy-single-fast5-aggressive2.sh #a nasty script to move the corrupted single fast5 to a quarantine director and run the f2s on the rest
@hiruna72, what happens if we convert to a single FAST5 and run your thing - will it remove the resultant BLOW5 file belonging to the whole process?
no only the bad fast5(s) will be skipped. The output blow5 of the process will not be deleted.
Thanks @hasindu2008 , I have not checked how many reads will be skipped after the conversion. If not too many, I think it will be fine, otherwise, I will still keep your scripts if I will need them someday.
And Thanks @hiruna72 . You are really nice guys.
You are welcome. I will close the issue. Feel free to reopen if you want anything more.