Header attributes warning
mbhall88 opened this issue · 14 comments
I got a number of warning messages when running slow5tools merge
on a directory of blow5 files. I was wondering if this is a big problem or whether I can ignore it?
[merge_main::WARNING] In file Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5, read_group 0 has a different number of header attributes than what the processed files had
When I ran slow5tools f2s
on the original data I did see the following warning a number of times
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header.
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header. This warning is suppressed now onwards.
are these two warnings related?
Hi @mbhall88
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header - This can be safely ignored. These new FAST5 have an empty per read attribute called pore_type but we have this warning in case this becomes non-empty we could intervene to see what that is.
[search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header.
This is something we better have a look at. Could you please grep the header of the blow5 'slow5tools view merged.bam | head -1000 | grep "^#|^@"'?
Ah which BAM file are you referring to? I assume you mean blow5?
Here's the output from one of the files with the warning
#slow5_version 0.2.0
#num_read_groups 1
@asic_id 4043209218
@asic_id_eeprom 5331776
@asic_temp 23.036455
@asic_version IA02D
@auto_update 0
@auto_update_source https://mirror.oxfordnanoportal.com/software/MinKNOW/
@barcoding_enabled 0
@bream_is_standard 0
@configuration_version 1.0.9
@device_id MN29867
@device_type minion
@distribution_status stable
@distribution_version 19.12.5
@exp_script_name sequencing/sequencing_MIN106_MIN107_RNA:FLO-MIN106:SQK-RNA002:False
@exp_script_purpose sequencing_run
@exp_start_time 2020-09-18T07:12:04Z
@experiment_duration_set 4320
@experiment_type rna
@file_type multi-read
@file_version 2.2
@flow_cell_id FAO66997
@flow_cell_product_code FLO-MIN106
@guppy_version 3.2.10+d9445b2
@heatsink_temp 31.046875
@hostname lachlan-MS-7B51
@installation_type nc
@local_basecalling 0
@local_firmware_file 1
@operating_system ubuntu 18.04
@package bream4
@package_version 4.3.16
@pore_type not_set
@protocol_group_id 20200918_Caco_C_2hpi_3
@protocol_run_id 5e3f9c45-1ac3-4e5c-ad7f-e9c2eddfcaa4
@protocols_version 4.3.16
@run_id 653d240289f2bbe23237cc9276a08ec1cd80b13e
@sample_frequency 3012
@sample_id 20200918_Caco_C_2hpi_3
@sequencing_kit sqk-rna002
@usb_config MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#bulk#Auto
@version 3.6.5
#char* uint32_t double double double double uint64_t int16_t* uint64_t int32_t uint8_t double enum{unknown,partial,mux_change,unblock_mux_change,signal_positive,signal_negative} char*
#read_id read_group digitisation offset range sampling_rate len_raw_signal raw_signal start_time read_number start_mux median_before end_reason channel_number
Ah yeh. BLOW5. Was working with some bam files so got mixed up.
-
The pore type is "not_set". So it is completely safe to ignore that warning. As this is a recently introduced attribute we will be keeping an eye on it from our side.
-
[merge_main::WARNING] In file Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5, read_group 0 has a different number of header attributes than what the processed files had
This warning means that despite the run_ids being the same, some global attributes that should have been consistent across the files are somehow found to be different.
For this could you please provide the header of Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5
and another file that it causes the merge warning with?
you can do slow5tools merge -o tmp.blow5 Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5 another_file
to find another file that causes this issue with.
I can't seem to recreate this warning with two files.
Some of the blow5 files indeed have a different number of headers (off by 1). @basecall_config_filename
seems to be the variable attribute. These fast5 files are pooled from a minION and gridION run and one used guppy v3 and the other v4. So I guess maybe this is where it is coming from?
Either way, when I pick two files with a different number of header attributes and merge them I don't get that warning. It only seems to happen when I merge the entire directory of blow5 files. Annoying I can't make an MRE for you sorry.
If it is the @basecall_config_filename we do not need to worry. It is well likely that nanopore is changing their filenames randomly in each release :D Could you send the header of the final merged file with the warning when you merge the entire directory? Also two header examples showing a different number of header lines will be helpful.
Either way, when I pick two files with a different number of header attributes and merge them I don't get that warning. It only seems to happen when I merge the entire directory of blow5 files @hiruna72 Any thoughts on this?
Thanks @mbhall88 for trying slow5tools.
Either way, when I pick two files with a different number of header attributes and merge them I don't get that warning. It only seems to happen when I merge the entire directory of blow5 files.
Could you please swap the order of the two files and try.
merge b.slow5 a.slow5 -o o.slow5
I think then it should give the warning. If so, it is because merge
first sees the file with less number of header attributes and uses it to structure the output header (for that particular run_id
). Later when another file with the same run_id
is read but has
- a different number/set of header attributes or
- same attribute names but with different values
it gives a warning.
merge
assumes fast5 files with the same run_id
should have the rest of the attributes also identical but still does a sanity check.
Could you please swap the order of the two files and try.
That didn't produce the warning either.
Could you send the header of the final merged file with the warning when you merge the entire directory?
#slow5_version 0.2.0
#num_read_groups 3
@asic_id 4043209218 416294303 416294303
@asic_id_eeprom 5331776 5048575 5048575
@asic_temp 23.036455 29.650421 28.911079
@asic_version IA02D IA02D IA02D
@auto_update 0 0 0
@auto_update_source https://mirror.oxfordnanoportal.com/software/MinKNOW/ https://mirror.oxfordnanoportal.com/software/MinKNOW/ https://mirror.oxfordnanoportal.com/software/MinKNOW/
@barcoding_enabled 0 0 0
@basecall_config_filename . rna_r9.4.1_70bps_hac.cfg rna_r9.4.1_70bps_hac.cfg
@bream_is_standard 0 0 0
@configuration_version 1.0.9 4.0.13 4.0.13
@device_id MN29867 X4 X4
@device_type minion gridion gridion
@distribution_status stable stable stable
@distribution_version 19.12.5 20.06.9 20.06.9
@exp_script_name sequencing/sequencing_MIN106_MIN107_RNA:FLO-MIN106:SQK-RNA002:False sequencing/sequencing_MIN106_RNA:FLO-MIN106:SQK-RNA002 sequencing/sequencing_MIN106_RNA:FLO-MIN106:SQK-RNA002
@exp_script_purpose sequencing_run sequencing_run sequencing_run
@exp_start_time 2020-09-18T07:12:04Z 2020-09-10T07:17:56Z 2020-09-11T02:06:53Z
@experiment_duration_set 4320 4320 3180
@experiment_type rna rna rna
@file_type multi-read multi-read multi-read
@file_version 2.2 2.2 2.2
@flow_cell_id FAO66997 FAO67142 FAO67142
@flow_cell_product_code FLO-MIN106 FLO-MIN106 FLO-MIN106
@guppy_version 3.2.10+d9445b2 4.0.11+f1071ce 4.0.11+f1071ce
@heatsink_temp 31.046875 34.078125 34.109375
@hostname lachlan-MS-7B51 GXB01312 GXB01312
@installation_type nc nc nc
@local_basecalling 0 1 1
@local_firmware_file 1 1 1
@operating_system ubuntu 18.04 ubuntu 16.04 ubuntu 16.04
@package bream4 bream4 bream4
@package_version 4.3.16 6.0.7 6.0.7
@pore_type not_set not_set not_set
@protocol_group_id 20200918_Caco_C_2hpi_3 20200910_dRNA_Caco_C_2hpi 20200910_dRNA_Caco_C_2hpi_2
@protocol_run_id 5e3f9c45-1ac3-4e5c-ad7f-e9c2eddfcaa4 c2227726-010d-4f2b-ba0d-68073d8e34fc 72aad869-3687-4ebd-8dae-a5bf4ced67ee
@protocols_version 4.3.16 6.0.7 6.0.7
@run_id 653d240289f2bbe23237cc9276a08ec1cd80b13e fca30ccc93cd51de0a4e059239fa413cbe99412e 8a03dbd5ffdbee721ad59d266d13d88a5e509303
@sample_frequency 3012 3012 3012
@sample_id 20200918_Caco_C_2hpi_3 20200910_dRNA_Caco_C_2hpi 20200910_dRNA_Caco_C_2hpi_2
@sequencing_kit sqk-rna002 sqk-rna002 sqk-rna002
@usb_config MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#bulk#Auto GridX5_fx3_1.1.3_ONT#MinION_fpga_1.1.1#bulk#Auto GridX5_fx3_1.1.3_ONT#MinION_fpga_1.1.1#bulk#Auto
@version 3.6.5 4.0.3 4.0.3
#char* uint32_t double double double double uint64_t int16_t* enum{unknown,partial,mux_change,unblock_mux_change,signal_positive,signal_negative} char* doubleint32_t uint8_t uint64_t
#read_id read_group digitisation offset range sampling_rate len_raw_signal raw_signal end_reason channel_number median_before read_number start_mux start_time
I have since learned a bit more about these files and have realised merging them might not be the best idea for various reasons. So that might be the reason for all the weirdness.
Anyway if you think there are still issues I'm happy to provide more details, but I would consider this closed for now.
For our in-house datasets, I convert a whole sample into a single BLOW5 file. For instance, say we ran a sample on a PromethION flowcell and it generated three-run ids (sequencing manually stopped and started for flowcell washing etc). All those three run_ids I convert to a single blow5 file. This is safe as all the FAST5 files would be of the same structure.
However, some care is needed when combining multiple samples, (e.g., a sample run today with another sample run after a MinKNOW update; MinION vs GridION runs). As these FAST5 files can be inconsistent across different settings (for example the file version field in FAST5 can be sometimes a string, sometimes an int and sometimes a float/double which is ridiculous, to be honest) it is better to keep an eye when merging those, especially if you are going to archive.
But for analysis purposes, it is a different story. I mix all the weird runs together (I have merged all those NA12878 public samples into one file) so that a single file with an index can be easily fed to f5c/nanopolish. These tools do not use these strange header fields and just work fine.
Also if you are converting for archiving purposes, please do a sanity check before any fast5 deleting by counting the number of reads in SLOW5 and FAST5. We recently came up with a strange dataset that has the same FAST5 file name inside pass and fail directories and how many such different weird cases are out there is a mystery.
A quick sanity check that we do in house using bash:
#estimate number of reads in multi-fast5
NUMFAST5=$(find fast5dir -name '*.fast5' | wc -l)
NUM_FAST5_READS=$(echo "($NUMFAST5)*4000" | bc)
echo $NUM_FAST5_READS
#get slow5reads
NUM_SLOW5_READS=$(slow5tools stats reads.blow5 | grep "number of records" | awk '{print $NF}')
echo $NUM_SLOW5_READS
For multi-fast5 with 4000 reads, these numbers should be closer (won't be exactly the same as the last FAST5 could have less than 4000 reads. An added advantage is, running slow5tools stats will read through the whole file and will complain if something is malformed.
If your dataset I, is not a closed dataset and if you could give us a directory of files causing that read_group 0 has a different number of header attributes than what the processed files had, we will be very happy to look into it. Again, we appreciate your help throughout in finding all these FAST5 idiosyncrasies to improve our tools :)
If your dataset I, is not a closed dataset and if you could give us a directory of files causing that read_group 0 has a different number of header attributes than what the processed files had, we will be very happy to look into it. Again, we appreciate your help throughout in finding all these FAST5 idiosyncrasies to improve our tools :)
Sure. What is the best way to share this with you?
Anything like dropbox, the institute web server is fine. You can send me the link on Twitter.
I've dug into this further and tl;dr I definitely should not have been trying to merge these files, so I think slow5tools has done the right thing. Had it not given that warning I would not have known they should not have been merged.
I have DM'd a link to 3 files I was able to reproduce the warning with.
I could not reproduce the warning with 2 files, it has to be 3 (or more I guess).
Order is also important. If r1
is the last file passed, then no warning is thrown.
Hi @mbhall88
Thanks for sending through the files. The warning in this case is harmless. r1 is a separate gridion run and r2,r3 are from the same MinION run. These files can be merged together if you wish - it makes sense to merge if they are from the same sample. The run id in r1 is different from that in r2,r3, so the generated BLOW5 file would contain two read groups. basecall_config_filename is found in the first read group, but not found in the second read group, so it will be stored as a "." in the second read group. See below:
@asic_id 681462101 683821001
@asic_id_eeprom 5727747 5735452
@asic_temp 29.024300 25.714636
@asic_version IA02D IA02D
@auto_update 0 0
@auto_update_source https://mirror.oxfordnanoportal.com/software/MinKNOW/ https://mirror.oxfordnanoportal.com/software/MinKNOW/
@barcoding_enabled 0 0
@basecall_config_filename rna_r9.4.1_70bps_hac.cfg .
...
@run_id 2e6d37a8cf48257abee9e13624a0c019fa22e037 4fd309c9a2a6bb1600aa572a317194ab8f04053f
@sample_frequency 3012 3012
@sample_id 20200915_dRNA_Calu_I_2hpi 20200920_Calu_I_2hpi_2
@sequencing_kit sqk-rna002 sqk-rna002
@usb_config GridX5_fx3_1.1.3_ONT#MinION_fpga_1.1.1#bulk#Auto MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#bulk#Auto
@version 4.0.3 3.6.5
Such merged files with multiple read groups are totally fine. You can later split them into single read group BLOW5 files using slow5tools split -g
.
Awesome. That's great to know! Thanks for the quick response.