nanoporetech/dorado

4 types of errors after upgrading from v0.7.2 to v0.8.1

Closed this issue · 3 comments

Issue Report

Hi! I've just upgraded version from v0.7.2 to v0.8.1, run it for 100 samples and found out 7 failed runs with 4 types of errors
For some failed samples I have v0.7.2 successful runs, logs provided
Will gratefully accept any help how to fix

Run environment:

  • Operating system: ubuntu:22.04 inside dorado singularity container, inside SLURM cluster
  • Cuda: Driver Version: 525.60.11 CUDA Version: 12.0
  • Hardware (CPUs, Memory, GPUs):
    • 12CPU (can provide more up to 48)
    • 170Gb RAM (can provide more up to 450)
    • 2 x Tesla V100S-PCIE-32GB
  • Source data type: pod5
  • Source data location: NFS
  • Details about data:
    • kit-flowcell: dna_r10.4.1_e8.2_400bps 5 kHz
    • number of reads: about 4319592 per sample
    • read lengths: about 5000
    • total dataset size: about 500-1000Gb per sample
  • Dataset to reproduce, if applicable: for now we cannot provide it

Current runs:

  • Dorado version: v0.8.1
  • Model: hac@v5.0,5mCG_5hmCG@v2
  • Dorado command:
"basecaller" "--verbose" "--recursive" "--models-directory" "/opt/dorado/models" "--min-qscore" "15" "--mm2-opts" "-Y --secondary yes" "--device" "cuda:all" "hac@v5.0,5mCG_5hmCG@v2" "/mnt/Storage/raw_ont/ID_MASKED/ID_MASKED/ID_MASKED" "--reference" "/mnt/Storage/databases/reference/GRCh38.d1.vd1.fa"

Past runs:

  • Dorado version: v0.7.2
  • Model: hac@v5.0,5mCG_5hmCG@v1
  • Dorado command:
"basecaller" "--verbose" "--recursive" "--min-qscore" "15" "-Y" "--secondary" "yes" "--device" "cuda:all" "--batchsize" "0" "hac@v5.0,5mCG_5hmCG@v1" "/mnt/Storage/raw_ont/ID_MASKED/ID_MASKED/ID_MASKED" "--reference" "/mnt/Storage/databases/reference/GRCh38.d1.vd1.fa"

Logs

Sample1

v0.8.1 failed

logs: sample1.dorado_hac@v5.0_CG@v2.minimap2.txt
error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid character in sequence.
/usr/bin/bash: line 1: 963860 Aborted                 (core dumped) dorado basecaller --verbose --recursive --models-directory /opt/dorado/models --min-qscore 15 --mm2-opts "-Y --secondary yes" --device cuda:all hac@v5.0,5mCG_5hmCG@v2 /mnt/Storage/raw_ont/ID_MASKED/ID_MASKED/ID_MASKED --reference /mnt/Storage/databases/reference/GRCh38.d1.vd1.fa > /mnt/Storage/testdata/Results/ID_MASKED/ID_MASKED/Temp/ID_MASKED.ONT.ID_MASKED.dorado_hac@v5.0_CG@v2.minimap2.bam

Sample2

v0.8.1 failed

logs: sample2.dorado_hac@v5.0_CG@v2.minimap2.txt
error:

[2024-10-06 14:47:48.223] [debug] Load reads from file /mnt/Storage/raw_ont/ID_MASKED/ID_MASKED/ID_MASKED/pod5/ID_MASKED_1.pod5
/usr/bin/bash: line 1: 153868 Bus error               (core dumped) dorado basecaller --verbose --recursive --models-directory /opt/dorado/models --min-qscore 15 --mm2-opts "-Y --secondary yes" --device cuda:all hac@v5.0,5mCG_5hmCG@v2 /mnt/Storage/raw_ont/ID_MASKED/ID_MASKED/ID_MASKED --reference /mnt/Storage/databases/reference/GRCh38.d1.vd1.fa > /mnt/Storage/testdata/Results/ID_MASKED/ID_MASKED/Temp/ID_MASKED.ONT.ID_MASKED.dorado_hac@v5.0_CG@v2.minimap2.bam

Sample3

v0.8.1 failed

logs: sample3.dorado_hac@v5.0_CG@v2.minimap2.txt
error:

[2024-10-06 17:43:53.899] [debug] cuda:1 Decode memory 4.96GB
/usr/bin/bash: line 1: 894539 Segmentation fault      (core dumped) dorado basecaller --verbose --recursive --models-directory /opt/dorado/models --min-qscore 15 --mm2-opts "-Y --secondary yes" --device cuda:all hac@v5.0,5mCG_5hmCG@v2 /mnt/Storage/raw_ont/ID_MASKED/ID_MASKED/ID_MASKED --reference /mnt/Storage/databases/reference/GRCh38.d1.vd1.fa > /mnt/Storage/testdata/Results/ID_MASKED/ID_MASKED/Temp/ID_MASKED.ONT.ID_MASKED.dorado_hac@v5.0_CG@v2.minimap2.bam

v0.7.2 successed

logs: sample3.dorado_hac@v5.0_CG@v1.minimap2.txt

Sample4

v0.8.1 failed

logs: sample4.dorado_hac@v5.0_CG@v2.minimap2.txt
error:

[W CublasHandlePool.cpp:56] Warning: Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 8519680 bytes. (function parseChosenWorkspaceSize)
[2024-10-06 18:24:37.685] [info] cuda:1 using chunk size 4998, batch size 5120
[2024-10-06 18:24:37.685] [debug] cuda:1 Model memory 12.01GB
[2024-10-06 18:24:37.685] [debug] cuda:1 Decode memory 4.96GB
[2024-10-06 18:24:39.396] [error] finalise() not called on a HtsFile.
[2024-10-06 18:24:39.399] [error] Expected CUBLAS_WORKSPACE_SPACE_CONFIG match of size 3 (Format :SIZE:COUNT)
Exception raised from parseChosenWorkspaceSize at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:61 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdf862f79b7 in /opt/dorado/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fdf7f87c1de in /opt/dorado/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90902a (0x7fdf861c802a in /opt/dorado/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0xa90972e (0x7fdf861c872e in /opt/dorado/bin/../lib/libdorado_torch_lib.so)
frame #4: at::cuda::getCurrentCUDABlasHandle() + 0x2a2 (0x7fdf861c89f2 in /opt/dorado/bin/../lib/libdorado_torch_lib.so)
frame #5: dorado() [0xae08f0]
frame #6: dorado() [0xae037f]
frame #7: dorado() [0xa4edbf]
frame #8: dorado() [0xa54890]
frame #9: dorado() [0xa54ca1]
frame #10: dorado() [0xa3fd82]
frame #11: dorado() [0xa7c5e9]
frame #12: dorado() [0xa7c718]
frame #13: dorado() [0xa7b583]
frame #14: dorado() [0x99ec19]
frame #15: dorado() [0x898f27]
frame #16: dorado() [0x84825b]
frame #17: <unknown function> + 0x99ee8 (0x7fdf7a644ee8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: dorado() [0x89932f]
frame #19: dorado() [0x8553b0]
frame #20: <unknown function> + 0x1196e380 (0x7fdf8d22d380 in /opt/dorado/bin/../lib/libdorado_torch_lib.so)
frame #21: <unknown function> + 0x94ac3 (0x7fdf7a63fac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x126850 (0x7fdf7a6d1850 in /lib/x86_64-linux-gnu/libc.so.6)

v0.7.2 successed

logs:
sample4.dorado_hac@v5.0_CG@v1.minimap2.txt

Hi @pustoshilov-d, we're looking into these issues. Thanks for bringing them to our attention

@pustoshilov-d, can you reproduce these issues when only one GPU is selected using --device cuda:0?

Closing this ticket since the underlying cause of these issues should be fixed in dorado-0.8.2. Please reopen if you're able to reproduce these issues in the future.