hasindu2008/slow5tools

slow5tools degrade (1.3.0) does not detect ULK kit?

Opened this issue · 5 comments

For the following s/blow5 header made with blue-crab (0.1.2) , it does not seem that slow5tools degrade (1.3.0) recognizes the ULK kit.

#slow5_version  0.2.0
#num_read_groups        1
@acquisition_id ca82937006c473b34e065122cf6a8ed73c55ce18
@acquisition_start_time 2024-06-26 09:25:49.033000+00:00
@adc_max        2047
@adc_min        0
@asic_id        FFFFFC0FE73734C0
@asic_id_eeprom FFFFFC0FE73734C0
@asic_temp      28.228447
@asic_version   Unknown
@barcoding_enabled      0
@basecall_config_filename       dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac_prom.cfg
@configuration_version  5.9.18
@data_source    real_device
@device_id      A
@device_type    p2_solo
@distribution_status    stable
@distribution_version   24.02.16
@exp_script_name        sequencing/sequencing_PRO114_DNA_e8_2_400K_long_read:FLO-PRO114M:SQK-ULK114:400
@exp_script_purpose     sequencing_run
@exp_start_time 2024-06-26T11:25:49.033544+02:00
@experiment_name        Blood-WGS_ONT_24062024
@experiment_type        genomic_dna
@flow_cell_id   PAU99561
@flow_cell_product_code FLO-PRO114M
@fpga_board_id  0018f5206e51685c
@fpga_firmware_version  2.1.0
@guppy_version  7.3.11+0112dde09
@heatsink_temp  34.045727
@host_product_code      GRD-MK1
@host_product_serial_number     GXB04189
@hostname       GXB04189
@installation_type      nc
@is_simulated   0
@local_basecalling      1
@operating_system       ubuntu 20.04
@package        bream4
@package_version        7.9.8
@protocol_group_id      Blood-WGS_ONT_24062024
@protocol_name  sequencing/sequencing_PRO114_DNA_e8_2_400K_long_read:FLO-PRO114M:SQK-ULK114:400
@protocol_run_id        d2c3e09e-da67-4bba-aecf-0c004874a607
@protocol_start_time    2024-06-26T11:24:03.627544+02:00
@protocols_version      7.9.8
@run_id ca82937006c473b34e065122cf6a8ed73c55ce18
@sample_frequency       5000
@sample_id      Blood-WGS_L3_26062024
@sample_rate    5000
@selected_speed_bases_per_second        400
@sequencer_hardware_revision    HW-30
@sequencer_position     P2S-00581-A
@sequencer_position_type        PromethION
@sequencer_product_code PRO-SEQ002
@sequencer_serial_number        P2S-00581
@sequencing_kit sqk-ulk114
@software       MinKNOW 24.02.16 (Bream 7.9.8, Core 5.9.12, Dorado 7.3.11+0112dde09)
@system_name    GXB04189
@system_type    GridION Mk1
@usb_config     fx3_0.0.0#fpga_0.0.0#unknown#unknown
@usb_firmware_version   2.5.1
@version        5.9.12
~/bin/slow5tools-v1.3.0/slow5tools degrade -s ex-zd -c zstd PAU99561_d2c3e09e_ca829370_21.blow5 -o PAU99561_d2c3e09e_ca829370_21.3.blow5

[degrade_main::WARNING] This tool performs lossy compression which is an irreversible operation. Just making sure it is intended. 
[slow5_hdr_get_dataset] Not detected: MinION DNA lsk114 5kHz
[slow5_hdr_get_dataset] Not detected: PromethION DNA lsk109 4kHz
[slow5_hdr_get_dataset] Not detected: PromethION DNA lsk114 4kHz
[slow5_hdr_get_dataset] Not detected: PromethION DNA lsk114 5kHz
[slow5_hdr_get_dataset] Not detected: PromethION RNA rna002 3kHz
[slow5_hdr_get_dataset] Not detected: PromethION RNA rna004 4kHz
[slow5_hdr_get_dataset::ERROR] No suitable bits suggestion
[degrade_main::ERROR] Use option -b to manually specify
~/bin/slow5tools-v1.3.0/slow5tools degrade -s ex-zd -c zstd PAU99561_d2c3e09e_ca829370_21.blow5 -o PAU99561_d2c3e09e_ca829370_21.3.blow5 -b4
[degrade_main::WARNING] This tool performs lossy compression which is an irreversible operation. Just making sure it is intended. 
[slow5_encode_signal_press::WARNING] Signal compression method ex-zd is new. While it is stable, just keep an eye. At src/slow5_press.c:116

[main] cmd: /home/jelber43/bin/slow5tools-v1.3.0/slow5tools degrade -s ex-zd -c zstd PAU99561_d2c3e09e_ca829370_21.blow5 -o PAU99561_d2c3e09e_ca829370_21.3.blow5 -b4
[main] real time = 40.577 sec | CPU time = 117.731 sec | peak RAM = 3.700 GB

I guess if it is possible to parse the ULK part, then that would be fine or to show the user what bit values to use for different datasets?

Hello, we are parsing the ulk part properly, but it is checking if the kits match the ones we exhaustively tested. As this is a lossy compression, we are being very pedantic to avoid a user from inadvertently getting their data affected. These kits will be eventually added when we come across them and test. I have not had access to GridION sqk-ulk114 data, but is very likely the suitable -b would be 3. Is this a publicly available dataset?

As per the Twitter conversation (https://x.com/jpelbers/status/1842484817885073502), here is a Dropbox link to ~30x average coverage ONT ULK reads for HG002 chr22 (based on alignment to hg38 no alts). They were HG002 cells with DNA extracted following a BioNano DNA extraction protocol, undergoing ONT ULK library preparation, then sequenced on an ONT PromethION P2 solo device with an r10.4.1 flowcell connected to a ONT GridION for data acquisition. Provided is an ex-zd, zstd blow5 file that you can access with

wget 'https://www.dropbox.com/scl/fi/8s0p4ttpuy1amiuulzu3v/WGS_HG002_Bionano_recover_13022024.chr22.readids.blow5?rlkey=395acerl9ewgyqkafi7g15ipe&st=giubcawn' -O WGS_HG002_Bionano_recover_13022024.chr22.blow5

on a computer with wget.

Best,
Jean Elbers

*NOTE that the blow5 file on Dropbox does not match the header above in this Github issue as I realized those squiggles did not belong to HG002.

Thanks, we will have a look at this as soon as possible.

OK, @KavinduJayas did the tests and 3-bits seems to be the suitable number of bits for removal.

Identity scores:
plot_WGS_HG002_Bionano_recover_13022024_rounded_1_vs_original_sup.pdf
plot_WGS_HG002_Bionano_recover_13022024_rounded_2_vs_original_sup.pdf
plot_WGS_HG002_Bionano_recover_13022024_rounded_3_vs_original_sup.pdf
plot_WGS_HG002_Bionano_recover_13022024_rounded_4_vs_original_sup.pdf

Methylation correlation:
WGS_HG002_Bionano_recover_13022024.chr22_rounded_1_bi_vs_remora.pdf
WGS_HG002_Bionano_recover_13022024.chr22_rounded_2_bi_vs_remora.pdf
WGS_HG002_Bionano_recover_13022024.chr22_rounded_3_bi_vs_remora.pdf
WGS_HG002_Bionano_recover_13022024.chr22_rounded_4_bi_vs_remora.pdf

@sashajenner could you please implement a profile for this data in the dev branch for degrade please? The relevant header data is as follows:

#slow5_version  0.2.0
#num_read_groups        1
@acquisition_id 014da3cd8f6521012f0430299be6ee90c8be10c8
@acquisition_start_time 2024-02-13 11:11:37.722000+00:00
@adc_max        2047
@adc_min        0
@asic_id        0004A30B01138266
@asic_id_eeprom 0004A30B01138266
@asic_temp      27.578566
@asic_version   Unknown
@barcoding_enabled      0
@basecall_config_filename       dna_r10.4.1_e8.2_400bps_5khz_hac_prom.cfg
@configuration_version  5.8.6
@data_source    real_device
@device_id      B
@device_type    p2_solo
@distribution_status    stable
@distribution_version   23.11.7
@exp_script_name        sequencing/sequencing_PRO114_DNA_e8_2_400K:FLO-PRO114M:SQK-ULK114:400
@exp_script_purpose     sequencing_run
@exp_start_time 2024-02-13T11:11:37.722531+00:00
@experiment_name        WGS_HG002_Bionano_recover_13022024
@experiment_type        genomic_dna
@flow_cell_id   PAU64142
@flow_cell_product_code FLO-PRO114M
@fpga_board_id  0018f5206e51685c
@fpga_firmware_version  2.1.0
@guppy_version  7.2.13+fba8e8925
@heatsink_temp  33.988201
@host_product_code      GRD-MK1
@host_product_serial_number     GXB04189
@hostname       GXB04189
@installation_type      nc
@is_simulated   0
@local_basecalling      1
@operating_system       ubuntu 20.04
@package        bream4
@package_version        7.8.2
@protocol_group_id      WGS_HG002_Bionano_recover_13022024
@protocol_name  sequencing/sequencing_PRO114_DNA_e8_2_400K:FLO-PRO114M:SQK-ULK114:400
@protocol_run_id        fd789ccc-282f-4e00-8532-909719d345b8
@protocol_start_time    2024-02-13T11:09:55.746097+00:00
@protocols_version      7.8.2
@run_id 014da3cd8f6521012f0430299be6ee90c8be10c8
@sample_frequency       5000
@sample_id      WGS_HG002_Bionano_recover
@sample_rate    5000
@selected_speed_bases_per_second        400
@sequencer_hardware_revision    HW-30
@sequencer_position     P2S-00581-B
@sequencer_position_type        PromethION
@sequencer_product_code PRO-SEQ002
@sequencer_serial_number        P2S-00581
@sequencing_kit sqk-ulk114
@software       MinKNOW 23.11.7 (Bream 7.8.2, Core 5.8.6, Dorado 7.2.13+fba8e8925)
@system_name    GXB04189
@system_type    GridION Mk1
@usb_config     fx3_0.0.0#fpga_0.0.0#unknown#unknown
@version        5.8.6

Since this kit can be used on different device types other than the PromethION 2 Solo, should we be ignoring the device_type header field? Or does the device affect the ideal number of bits to remove?