hasindu2008/slow5tools

Error during fast5 conversion: primary data field missing

maximilianmordig opened this issue · 6 comments

I am trying to convert data to slow5 and get the following error:

wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR9127551/ecoli_r9.tar.gz.1 | tar xzv; mv r9/f5s/RefStrains210914_NK/f5s/barcode02/*.fast5 fast5_files; rm -rf r9

slow5tools f2s fast5_files/barcode02_r0barcode02b0_0.fast5 -o test1.slow5

The archive is big, so you can stop after encountering one file.

The output from slow5tools v1.1.0 is:

[f2s_main] 1 fast5 files found - took 0.000s
[f2s_main] Just before forking, peak RAM = 0.000 GB
[f2s_iop] 1 proceses will be used.
[add_aux_slow5_attribute::ERROR] Could not initialize the record attribute 'Raw/start_time' in fast5_files/barcode02_r0barcode02b0_0.fast5
[fast5_attribute_itr::ERROR] Could not add the auxiliary attribute Raw/start_time in fast5_files/barcode02_r0barcode02b0_0.fast5 to the slow5 record
[read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file fast5_files/barcode02_r0barcode02b0_0.fast5.
[f2s_child_worker::ERROR] Could not read contents of the fast5 file 'fast5_files/barcode02_r0barcode02b0_0.fast5'.

Running with lossless:

> slow5tools f2s fast5_files/barcode02_r0barcode02b0_0.fast5 -o slow5_files/ttt.slow5 --lossless false
[parse_arg_lossless::WARNING] You have requested lossy conversion. Generated files are only to be used for intermediate analysis and NOT for archiving. You will not be able to convert lossy files back to FAST5.
[f2s_main] 1 fast5 files found - took 0.000s
[f2s_main] Just before forking, peak RAM = 0.000 GB
[f2s_iop] 1 proceses will be used.
[read_fast5::ERROR] Bad fast5: A primary data field (read_id=ffccd4e1-b02b-4488-a5cd-ab5cf1ff8917) is missing in the fast5_files/barcode02_r0barcode02b0_0.fast5.
[f2s_child_worker::ERROR] Could not read contents of the fast5 file 'fast5_files/barcode02_r0barcode02b0_0.fast5'

I don't know which field is missing, it would be helpful to add this to the error message.

Btw, the help seems to be incorrect:

--lossless                retain information in auxiliary fields during the conversion [true]

lossless takes an argument.

Hey,

we will have a look a the fast5 data, but how old are these fast5 files, and does it happen with all fast5 files, or only some?

Cheers,
James

Thanks for looking into it.
I think they are quite old. 30/89 files pass.

Hi @maximilianmordig

We have downloaded that dataset and inspected it. It is another weird form of FAST5 files (probably the 100th different type of FAST5 we have come across). This time, it identifies as a single fast5 based on the file version but is multi-fast5.

There are two ways to get these sort of files converted:

  1. Using this script that takes a directory of FAST5 files. This is not very efficient as it converts multi-to-single and many other I/O-intensive stuff, but it has many sanity checks built-in to ensure that the conversion is perfect.
    I have already converted your dataset using this method and uploaded it so that you can just download and save your time.
wget -O ERR9127551_RefStrains210914_NK_barcode02.blow5 https://unsw-my.sharepoint.com/:u:/g/personal/z5136909_ad_unsw_edu_au/Ecag7i1h41xFnmOQTbUhqLkBtUtkCdoorBtQgfgxT1IWng?download=1
wget -O ERR9127551_RefStrains210914_NK_barcode02.blow5.idx https://unsw-my.sharepoint.com/:u:/g/personal/z5136909_ad_unsw_edu_au/EcOqgJWo-1VAh21NUk8ZGD4BBKw0rMTwRIDUgecCk_MDsg?download=1
  1. @hiruna72 has just implemented a patch in the branch named enforce_multi_fast5 which he will explain.

Hey @maximilianmordig,

Thank you reporting this.

The fast5 files you have shared don't have file version or file type attributes. Hence, the program cannot detect them as multi-fast5 and decides them as the default single-fast5 files. This causes the problems you faced.

I added verbose lines to print what the program decides the file types to be (slow5tools --verbose 7 f2s ...).

Additionally, I introduced an enforce-multi-fast5 feature (slow5tools f2s --enforce-multi-fast5 ...) this will enforce the program to handle files as multi-fast5. Use this branch.

If a conversion fails, use verbose output to check if the program has failed detect the correct file type. If so, then use --enforce-multi-fast5 if appropriate. This flag is not visible in the f2s help message, as this is not a common scenario.

@maximilianmordig has this issue been resolved?

@hasindu2008 @hiruna72
Sorry for the slow reply, I wasn't feeling well recently.
I checked the tool and it works now! Thank you very much.