hasindu2008/slow5tools

Header attributes warning

mbhall88 opened this issue · 14 comments

I got a number of warning messages when running slow5tools merge on a directory of blow5 files. I was wondering if this is a big problem or whether I can ignore it?

[merge_main::WARNING] In file Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5, read_group 0 has a different number of header attributes than what the processed files had

When I ran slow5tools f2s on the original data I did see the following warning a number of times

[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header.
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header. This warning is suppressed now onwards.

are these two warnings related?

Hi @mbhall88
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header - This can be safely ignored. These new FAST5 have an empty per read attribute called pore_type but we have this warning in case this becomes non-empty we could intervene to see what that is.

[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header.
This is something we better have a look at. Could you please grep the header of the blow5 'slow5tools view merged.bam | head -1000 | grep "^#|^@"'?

Ah which BAM file are you referring to? I assume you mean blow5?

Here's the output from one of the files with the warning

#slow5_version  0.2.0
#num_read_groups        1
@asic_id        4043209218
@asic_id_eeprom 5331776
@asic_temp      23.036455
@asic_version   IA02D
@auto_update    0
@auto_update_source     https://mirror.oxfordnanoportal.com/software/MinKNOW/
@barcoding_enabled      0
@bream_is_standard      0
@configuration_version  1.0.9
@device_id      MN29867
@device_type    minion
@distribution_status    stable
@distribution_version   19.12.5
@exp_script_name        sequencing/sequencing_MIN106_MIN107_RNA:FLO-MIN106:SQK-RNA002:False
@exp_script_purpose     sequencing_run
@exp_start_time 2020-09-18T07:12:04Z
@experiment_duration_set        4320
@experiment_type        rna
@file_type      multi-read
@file_version   2.2
@flow_cell_id   FAO66997
@flow_cell_product_code FLO-MIN106
@guppy_version  3.2.10+d9445b2
@heatsink_temp  31.046875
@hostname       lachlan-MS-7B51
@installation_type      nc
@local_basecalling      0
@local_firmware_file    1
@operating_system       ubuntu 18.04
@package        bream4
@package_version        4.3.16
@pore_type      not_set
@protocol_group_id      20200918_Caco_C_2hpi_3
@protocol_run_id        5e3f9c45-1ac3-4e5c-ad7f-e9c2eddfcaa4
@protocols_version      4.3.16
@run_id 653d240289f2bbe23237cc9276a08ec1cd80b13e
@sample_frequency       3012
@sample_id      20200918_Caco_C_2hpi_3
@sequencing_kit sqk-rna002
@usb_config     MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#bulk#Auto
@version        3.6.5
#char*  uint32_t        double  double  double  double  uint64_t        int16_t*        uint64_t        int32_t uint8_t double  enum{unknown,partial,mux_change,unblock_mux_change,signal_positive,signal_negative}   char*
#read_id        read_group      digitisation    offset  range   sampling_rate   len_raw_signal  raw_signal      start_time      read_number     start_mux       median_before   end_reason   channel_number

Ah yeh. BLOW5. Was working with some bam files so got mixed up.

  • The pore type is "not_set". So it is completely safe to ignore that warning. As this is a recently introduced attribute we will be keeping an eye on it from our side.

  • [merge_main::WARNING] In file Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5, read_group 0 has a different number of header attributes than what the processed files had
    This warning means that despite the run_ids being the same, some global attributes that should have been consistent across the files are somehow found to be different.

For this could you please provide the header of Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5 and another file that it causes the merge warning with?
you can do slow5tools merge -o tmp.blow5 Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5 another_file to find another file that causes this issue with.

I can't seem to recreate this warning with two files.

Some of the blow5 files indeed have a different number of headers (off by 1). @basecall_config_filename seems to be the variable attribute. These fast5 files are pooled from a minION and gridION run and one used guppy v3 and the other v4. So I guess maybe this is where it is coming from?

Either way, when I pick two files with a different number of header attributes and merge them I don't get that warning. It only seems to happen when I merge the entire directory of blow5 files. Annoying I can't make an MRE for you sorry.

If it is the @basecall_config_filename we do not need to worry. It is well likely that nanopore is changing their filenames randomly in each release :D Could you send the header of the final merged file with the warning when you merge the entire directory? Also two header examples showing a different number of header lines will be helpful.

Either way, when I pick two files with a different number of header attributes and merge them I don't get that warning. It only seems to happen when I merge the entire directory of blow5 files @hiruna72 Any thoughts on this?

Thanks @mbhall88 for trying slow5tools.

Either way, when I pick two files with a different number of header attributes and merge them I don't get that warning. It only seems to happen when I merge the entire directory of blow5 files.

Could you please swap the order of the two files and try.

merge b.slow5 a.slow5 -o o.slow5

I think then it should give the warning. If so, it is because merge first sees the file with less number of header attributes and uses it to structure the output header (for that particular run_id). Later when another file with the same run_id is read but has

  1. a different number/set of header attributes or
  2. same attribute names but with different values
    it gives a warning.

merge assumes fast5 files with the same run_id should have the rest of the attributes also identical but still does a sanity check.

Could you please swap the order of the two files and try.

That didn't produce the warning either.

Could you send the header of the final merged file with the warning when you merge the entire directory?

#slow5_version  0.2.0
#num_read_groups        3
@asic_id        4043209218      416294303       416294303
@asic_id_eeprom 5331776 5048575 5048575
@asic_temp      23.036455       29.650421       28.911079
@asic_version   IA02D   IA02D   IA02D
@auto_update    0       0       0
@auto_update_source     https://mirror.oxfordnanoportal.com/software/MinKNOW/   https://mirror.oxfordnanoportal.com/software/MinKNOW/   https://mirror.oxfordnanoportal.com/software/MinKNOW/
@barcoding_enabled      0       0       0
@basecall_config_filename       .       rna_r9.4.1_70bps_hac.cfg        rna_r9.4.1_70bps_hac.cfg
@bream_is_standard      0       0       0
@configuration_version  1.0.9   4.0.13  4.0.13
@device_id      MN29867 X4      X4
@device_type    minion  gridion gridion
@distribution_status    stable  stable  stable
@distribution_version   19.12.5 20.06.9 20.06.9
@exp_script_name        sequencing/sequencing_MIN106_MIN107_RNA:FLO-MIN106:SQK-RNA002:False     sequencing/sequencing_MIN106_RNA:FLO-MIN106:SQK-RNA002  sequencing/sequencing_MIN106_RNA:FLO-MIN106:SQK-RNA002
@exp_script_purpose     sequencing_run  sequencing_run  sequencing_run
@exp_start_time 2020-09-18T07:12:04Z    2020-09-10T07:17:56Z    2020-09-11T02:06:53Z
@experiment_duration_set        4320    4320    3180
@experiment_type        rna     rna     rna
@file_type      multi-read      multi-read      multi-read
@file_version   2.2     2.2     2.2
@flow_cell_id   FAO66997        FAO67142        FAO67142
@flow_cell_product_code FLO-MIN106      FLO-MIN106      FLO-MIN106
@guppy_version  3.2.10+d9445b2  4.0.11+f1071ce  4.0.11+f1071ce
@heatsink_temp  31.046875       34.078125       34.109375
@hostname       lachlan-MS-7B51 GXB01312        GXB01312
@installation_type      nc      nc      nc
@local_basecalling      0       1       1
@local_firmware_file    1       1       1
@operating_system       ubuntu 18.04    ubuntu 16.04    ubuntu 16.04
@package        bream4  bream4  bream4
@package_version        4.3.16  6.0.7   6.0.7
@pore_type      not_set not_set not_set
@protocol_group_id      20200918_Caco_C_2hpi_3  20200910_dRNA_Caco_C_2hpi       20200910_dRNA_Caco_C_2hpi_2
@protocol_run_id        5e3f9c45-1ac3-4e5c-ad7f-e9c2eddfcaa4    c2227726-010d-4f2b-ba0d-68073d8e34fc    72aad869-3687-4ebd-8dae-a5bf4ced67ee
@protocols_version      4.3.16  6.0.7   6.0.7
@run_id 653d240289f2bbe23237cc9276a08ec1cd80b13e        fca30ccc93cd51de0a4e059239fa413cbe99412e        8a03dbd5ffdbee721ad59d266d13d88a5e509303
@sample_frequency       3012    3012    3012
@sample_id      20200918_Caco_C_2hpi_3  20200910_dRNA_Caco_C_2hpi       20200910_dRNA_Caco_C_2hpi_2
@sequencing_kit sqk-rna002      sqk-rna002      sqk-rna002
@usb_config     MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#bulk#Auto        GridX5_fx3_1.1.3_ONT#MinION_fpga_1.1.1#bulk#Auto        GridX5_fx3_1.1.3_ONT#MinION_fpga_1.1.1#bulk#Auto
@version        3.6.5   4.0.3   4.0.3
#char*  uint32_t        double  double  double  double  uint64_t        int16_t*        enum{unknown,partial,mux_change,unblock_mux_change,signal_positive,signal_negative}     char*   doubleint32_t uint8_t uint64_t
#read_id        read_group      digitisation    offset  range   sampling_rate   len_raw_signal  raw_signal      end_reason      channel_number  median_before   read_number     start_mux    start_time

I have since learned a bit more about these files and have realised merging them might not be the best idea for various reasons. So that might be the reason for all the weirdness.

Anyway if you think there are still issues I'm happy to provide more details, but I would consider this closed for now.

For our in-house datasets, I convert a whole sample into a single BLOW5 file. For instance, say we ran a sample on a PromethION flowcell and it generated three-run ids (sequencing manually stopped and started for flowcell washing etc). All those three run_ids I convert to a single blow5 file. This is safe as all the FAST5 files would be of the same structure.

However, some care is needed when combining multiple samples, (e.g., a sample run today with another sample run after a MinKNOW update; MinION vs GridION runs). As these FAST5 files can be inconsistent across different settings (for example the file version field in FAST5 can be sometimes a string, sometimes an int and sometimes a float/double which is ridiculous, to be honest) it is better to keep an eye when merging those, especially if you are going to archive.

But for analysis purposes, it is a different story. I mix all the weird runs together (I have merged all those NA12878 public samples into one file) so that a single file with an index can be easily fed to f5c/nanopolish. These tools do not use these strange header fields and just work fine.

Also if you are converting for archiving purposes, please do a sanity check before any fast5 deleting by counting the number of reads in SLOW5 and FAST5. We recently came up with a strange dataset that has the same FAST5 file name inside pass and fail directories and how many such different weird cases are out there is a mystery.

A quick sanity check that we do in house using bash:

#estimate number of reads in multi-fast5
NUMFAST5=$(find fast5dir -name '*.fast5' | wc -l)
NUM_FAST5_READS=$(echo "($NUMFAST5)*4000" | bc)
echo $NUM_FAST5_READS

#get slow5reads 			
NUM_SLOW5_READS=$(slow5tools stats reads.blow5 | grep "number of records" | awk '{print $NF}')
echo $NUM_SLOW5_READS

For multi-fast5 with 4000 reads, these numbers should be closer (won't be exactly the same as the last FAST5 could have less than 4000 reads. An added advantage is, running slow5tools stats will read through the whole file and will complain if something is malformed.

If your dataset I, is not a closed dataset and if you could give us a directory of files causing that read_group 0 has a different number of header attributes than what the processed files had, we will be very happy to look into it. Again, we appreciate your help throughout in finding all these FAST5 idiosyncrasies to improve our tools :)

If your dataset I, is not a closed dataset and if you could give us a directory of files causing that read_group 0 has a different number of header attributes than what the processed files had, we will be very happy to look into it. Again, we appreciate your help throughout in finding all these FAST5 idiosyncrasies to improve our tools :)

Sure. What is the best way to share this with you?

Anything like dropbox, the institute web server is fine. You can send me the link on Twitter.

I've dug into this further and tl;dr I definitely should not have been trying to merge these files, so I think slow5tools has done the right thing. Had it not given that warning I would not have known they should not have been merged.

I have DM'd a link to 3 files I was able to reproduce the warning with.

I could not reproduce the warning with 2 files, it has to be 3 (or more I guess).

Order is also important. If r1 is the last file passed, then no warning is thrown.

Hi @mbhall88
Thanks for sending through the files. The warning in this case is harmless. r1 is a separate gridion run and r2,r3 are from the same MinION run. These files can be merged together if you wish - it makes sense to merge if they are from the same sample. The run id in r1 is different from that in r2,r3, so the generated BLOW5 file would contain two read groups. basecall_config_filename is found in the first read group, but not found in the second read group, so it will be stored as a "." in the second read group. See below:

@asic_id        681462101       683821001
@asic_id_eeprom 5727747 5735452
@asic_temp      29.024300       25.714636
@asic_version   IA02D   IA02D
@auto_update    0       0
@auto_update_source     https://mirror.oxfordnanoportal.com/software/MinKNOW/   https://mirror.oxfordnanoportal.com/software/MinKNOW/
@barcoding_enabled      0       0
@basecall_config_filename       rna_r9.4.1_70bps_hac.cfg        .
...
@run_id 2e6d37a8cf48257abee9e13624a0c019fa22e037        4fd309c9a2a6bb1600aa572a317194ab8f04053f
@sample_frequency       3012    3012
@sample_id      20200915_dRNA_Calu_I_2hpi       20200920_Calu_I_2hpi_2
@sequencing_kit sqk-rna002      sqk-rna002
@usb_config     GridX5_fx3_1.1.3_ONT#MinION_fpga_1.1.1#bulk#Auto        MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#bulk#Auto
@version        4.0.3   3.6.5

Such merged files with multiple read groups are totally fine. You can later split them into single read group BLOW5 files using slow5tools split -g.

Awesome. That's great to know! Thanks for the quick response.