Dorado0.8.0 lost lots of reads after rebasecalling
Closed this issue · 5 comments
Issue Report
Please describe the issue:
The target base number of output fastq should be over 500M, which was true when using Dorado 0.6.0. However, when I used Dorado 0.8.0, the largest fastq file only had 2M bases.
Steps to reproduce the issue:
$dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam
$dorado demux --output-dir $output_scratch_demultiplex --kit-name SQK-NBD114-24 $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam
Run environment:
- Dorado version: 0.8.0
- Dorado command:
$dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam
$dorado demux --output-dir $output_scratch_demultiplex --kit-name SQK-NBD114-24 $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam
- Operating system: Linux
- Hardware (CPUs, Memory, GPUs): NVIDIA H100 PCIe
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
- Source data location (on device or networked drive - NFS, etc.): on device
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): SQK-NBD114-24
Logs
[2024-09-28 07:11:16.955] [info] Running: "basecaller" "--recursive" "/scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "/scratch/project/genoepic_rumen/ecoli_dna_methyl/pod5" "--modified-bases" "6mA" "--kit-name" "SQK-NBD114-24"
[2024-09-28 07:11:17.807] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar SSL_CERT_FILE
to specify the location manually.
[2024-09-28 07:11:17.813] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v2 with httplib
[2024-09-28 07:11:17.877] [error] Failed to download dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v2: SSL server verification failed
[2024-09-28 07:11:17.877] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v2 with curl
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 23 18.4M 23 4375k 0 0 71.6M 0 --:--:-- --:--:-- --:--:-- 71.2M^M100 18.4M 100 18.4M 0 0 170M 0 --:--:-- --:--:-- --:--:-- 169M
[2024-09-28 07:11:18.226] [info] > Creating basecall pipeline
[2024-09-28 07:12:07.562] [warning] Unable to find chunk benchmarks for GPU "NVIDIA H100 PCIe", model /scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-09-28 07:12:07.562] [warning] Unable to find chunk benchmarks for GPU "NVIDIA H100 PCIe", model /scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-09-28 07:12:08.922] [info] cuda:0 using chunk size 12288, batch size 96
[2024-09-28 07:12:08.922] [info] cuda:1 using chunk size 12288, batch size 96
[2024-09-28 07:12:09.008] [info] cuda:0 using chunk size 6144, batch size 96
[2024-09-28 07:12:09.013] [info] cuda:1 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'std::runtime_error'
what(): Empty sequence and qstring provided for read id 39d5fcd5-ac11-48f5-acea-169a2736a9f0
/var/spool/slurmd/job10990506/slurm_script: line 33: 2837796 Aborted (core dumped) $dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam
[2024-09-28 07:42:06.080] [info] Running: "demux" "--output-dir" "/scratch/project/genoepic_rumen/ecoli_dna_methyl_dorado_0_8/demultiplex_sup" "--kit-name" "SQK-NBD114-24" "/scratch/project/genoepic_rumen/ecoli_dna_methyl_dorado_0_8/bam_sup/ecoli_dna_exp_sta_6mA_sup.bam"
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[2024-09-28 07:42:06.119] [info] num input files: 1
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[2024-09-28 07:42:06.382] [info] > starting barcode demuxing
Hi @SimonChen1997,
It looks like the original base calling job crashed. This is why you have very little output.
terminate called after throwing an instance of 'std::runtime_error'
what(): Empty sequence and qstring provided for read id 39d5fcd5-ac11-48f5-acea-169a2736a9f0
It looks like you have a problematic read.
The demix job is also telling you there's something wrong with the base calling output
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Best regards,
Rich
The demix job is also telling you there's something wrong with the base calling output
Hi,
Thanks for your reply. However, all the pod5 files can be successfully rebased using Dorado 0.6.0.
Can I ask the reason for it?
Cheers,
Ziming
This is presumably a variant on #1020.
Also note: you are performing barcoding twice. You only need to specify --kit-name
to either dorado basecaller
or to dorado demux
- your current command will lead to many unclassified reads as the barcodes will be trimmed after the first step. Since you are seeing this error, I suggest dropping it from the basecaller command, (and possibly adding --no-trim
), then let dorado demux
handle the barcoding and trimming.
This is presumably a variant on #1020.
Also note: you are performing barcoding twice. You only need to specify
--kit-name
to eitherdorado basecaller
or todorado demux
- your current command will lead to many unclassified reads as the barcodes will be trimmed after the first step. Since you are seeing this error, I suggest dropping it from the basecaller command, (and possibly adding--no-trim
), then letdorado demux
handle the barcoding and trimming.
Hi,
Thanks. I did use --no-trim
after I posted the issue, and it worked.
However, without adding --no-trim
flag worked well for 0.6.0 version.
Anyways, thanks for your reply. 😊
This issue should be resolved in dorado 0.8.1, which has just been released.