hasindu2008/slow5tools

Conversion back to fast5 is broken in some cases

mattloose opened this issue · 11 comments

Using slow5tools v0.3.0

The input file had an unexpected field (end_reason) which triggered a warning on conversion to blow5.

Trying to recreate the fast5 from the blow5 file resulted in:

[list_all_items] Looking for '*low5' files in test_blow5/ [s2f_main] 1 files found - took 0.001s [s2f_iop] 1 proceses will be used [s2f_child_worker] Converting test_blow5//FAL37440_pass_c4fa58d7_179.blow5 to fast5 [slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1181 [slow5_get_aux_enum_labels::ERROR] Exiting on error. At src/slow5.c:1181

If you can provide an email I can send an example file.

Data were generated in May 2020 using GridION - 19.12.6 - Guppy Version was 3.2.10

@mattloose I sent my email to you on Twitter chat.
This end_reason end reason has been an absolute pain. It first appeared as a string data type first, then somewhere it changed to an ENUM. In between how many variants there would have been - even nanopore may not know probably :D
Fun fact: Did you know that the file version in FAST5 could be either a string, or an INT or even a floating-point number? Soon it could be a complex number :D

Looks like this file contains that data as an enum.

This enum type was handled. Could you double-check the slow5tools version you are using as slow5tools --version?

I tried the file you sent on my end, and it seems to convert back and forth:

hasindu@hasindu-xps:/mnt/c/Users/hasindu/Desktop/test_blow5$ ~/slow5tools-v0.3.0/slow5tools f2s orig/ -d blow5
[list_all_items] Looking for '*.fast5' files in orig/
[f2s_main] 1 fast5 files found - took 0.001s
[f2s_iop] 1 proceses will be used.
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header.
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header. This warning is suppressed now onwards.
[f2s_child_worker::INFO] Summary - total fast5: 1, bad fast5: 0

[f2s_main] Converting 1 fast5 files took 3.024s

[main] cmd: /home/hasindu/slow5tools-v0.3.0/slow5tools f2s orig/ -d blow5
[main] real time = 3.027 sec | CPU time = 3.047 sec | peak RAM = 0.041 GB
hasindu@hasindu-xps:/mnt/c/Users/hasindu/Desktop/test_blow5$ ~/slow5tools-v0.3.0/slow5tools s2f blow5 -d fast5_back
[list_all_items] Looking for '*low5' files in blow5
[s2f_main] 1 files found - took 0.001s
[s2f_iop] 1 proceses will be used
[s2f_child_worker] Converting blow5/FAL37440_pass_c4fa58d7_179.blow5 to fast5
[s2f_main] Converting 1 s/blow5 files took 3.419s

[main] cmd: /home/hasindu/slow5tools-v0.3.0/slow5tools s2f blow5 -d fast5_back
[main] real time = 3.424 sec | CPU time = 3.438 sec | peak RAM = 0.033 GB
hasindu@hasindu-xps:/mnt/c/Users/hasindu/Desktop/test_blow5$ ls * -lh
blow5:
total 34M
-rwxrwxrwx 1 hasindu hasindu 34M Feb  3 00:13 FAL37440_pass_c4fa58d7_179.blow5

fast5_back:
total 75M
-rwxrwxrwx 1 hasindu hasindu 75M Feb  3 00:13 FAL37440_pass_c4fa58d7_179.fast5

orig:
total 101M
-rwxrwxrwx 1 hasindu hasindu 101M Feb  3 00:11 FAL37440_pass_c4fa58d7_179.fast5

Yes - I've sent you more files...

What I have determined is that the file that breaks has been through ONTs compress_fast5 feature. In that file the end reason is an 8-bit signed int. Not an enum!

But still we will soon update slow5tools to be able to convert such broken FAST5 back and forth. Because compress_fast5 removed the enum labels and just stored as an uint8_t, the end reason field in such files is basically lost. But we will write an update to slow5tools to handle this situation. There are a few ways that come into my mind:

  1. Drop that damaged field when converting to SLOW5
  2. Assuming the int8_t number is the enum index and assuming ONT has not changed the enum order, we can try to reconstruct the enum label when converting to SLOW5. But this recovery could be errenoeous in case our assumption is wrong
  3. Store the damaged field as it is as a uint8_t and when converting back using s2f propagate the damaged field as it is.

@mattloose Which option do you prefer?
1 is the easiest. I do not like 3 much as I would like to keep the field data types consistent in SLOW5 unlike ONT.
I personally prefer 2.

I would expect slow5 to contain whatever was in the fast5 that was passed to it. I.e I don't think slow5 conversion should be making decisions about what the data are in the file - it should simply be storing what was in the original file.

So I'd vote for 3. In this case the error was caused by the upstream conversion. I'd expect the input error to be carried forward. Especially as if you do 2 and get it wrong it could be really hard to track down as a user as the field would look correct when in fact it was not.

Option 4) do a pull request on the ont conversion tool so it doesn't cause this problem.

@mattloose
The latest slow5tools dev branch has this implemented (option 3 we discussed above)- if the end_reason field is corrupted, f2s will save the corrupted attribute as it is with a warning. Also, s2f works on such created BLOW5 files, propagating the corrupted field as it is. Could you please give a try? To compile from dev branch:

git clone -b dev --recursive https://github.com/hasindu2008/slow5tools 
cd slow5tools
autoreconf	# autoreconf --install for macos
./configure
make

@Psy-Fer ONT conversion tool has been potentially fixed.

Hi @hasindu2008 , we only have fast5s with falsely converted "end_reason", but the dev branch seems to dropped support for these "corrupted" files. We carelessly trusted the ont-fast5-api and compressed these files in place, so no original fast5 files can be accessed now. Is there any possibility to rescue these files? Thanks!

I have compiled the dev branch as instructed.

$ slow5tools --version
slow5tools 0.8.0-dirty

But I am still getting the following errors, and only empty slow5 files are generated.

[read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file ./fast5_fail/FAP34515_fail_60478b74_47.fast5.
[f2s_child_worker::ERROR] Bad fast5: Could not read contents of the fast5 file './fast5_fail/FAP34515_fail_60478b74_47.fast5'.
[fast5_attribute_itr::ERROR] Attribute Raw/end_reason in ./fast5_fail/FAP34515_fail_60478b74_17.fast5 is corrupted (datatype H5T_STD_U8LE instead of expected H5T_ENUM). This is a known issue in ont_fast5_api's compress_fast5 (see https://github.com/hasindu2008/slow5tools/issues/59 and https://github.com/nanoporetech/ont_fast5_api/issues/70).
 Please get your FAST5 files fixed before SLOW5 coversion, by bugging ONT through GitHub issues.
$ ll
total 0
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_0.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_10.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_15.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_17.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_27.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_32.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_36.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_37.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_38.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_41.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_44.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_46.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_47.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_4.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_50.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_51.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_57.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_62.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_63.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_7.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_8.blow5
-rw-rw-r-- 1 zhangjy zhangjy 0 Nov 14  2022 FAP34515_fail_60478b74_9.blow5

@Kevinzjy
We ended up erroring out rather than propagating the corrupted attribute as we thought the user must be made aware of this. That way, the users can reach out to ONT [see https://github.com/nanoporetech/ont_fast5_api/issues/70] and see if there is a way to get those attributes recovered (I suggest opening an issue there and getting them to write a programme that recovers your data).

The reason why I dropped that feature to dump the corrupted attribute to SLOW5 was that such a datatype mismatch (inconsistencies) can cause future headaches, for instance, merging. If ONT does not provide a positive solution for you, we will implement an option to drop such corrupted attributes during fast5 to slow5 conversion, because if ONT is not going to provide a solution, there is no point in having that corrupted attribute anyway. To add further, that end_reason attribute as pointed out by @mattloose is not to be trusted as MInKNOW at some point was classifying end_reason incorrectly.

@Kevinzjy We ended up erroring out rather than propagating the corrupted attribute as we thought the user must be made aware of this. That way, the users can reach out to ONT [see https://github.com/nanoporetech/ont_fast5_api/issues/70] and see if there is a way to get those attributes recovered (I suggest opening an issue there and getting them to write a programme that recovers your data).

The reason why I dropped that feature to dump the corrupted attribute to SLOW5 was that such a datatype mismatch (inconsistencies) can cause future headaches, for instance, merging. If ONT does not provide a positive solution for you, we will implement an option to drop such corrupted attributes during fast5 to slow5 conversion, because if ONT is not going to provide a solution, there is no point in having that corrupted attribute anyway. To add further, that end_reason attribute as pointed out by @mattloose is not to be trusted as MInKNOW at some point was classifying end_reason incorrectly.

Thanks for the reply @hasindu2008 . I totally agree that ONT should do something to fix this issue. I'll open a new issue there to see if they can provide a solution.