nanoporetech/dorado

Recommended chunk and batch sizes for R9 basecalling

Closed this issue · 5 comments

Hello again,

I'm basecalling some old data ran with LSK109 and R9 flowcells a few years ago. I assume dorado can do this since there is a model for it. However, dorado complains that it does not know how to set the batch size for this. Any advice on the parameters to set?

Command:

dorado basecaller --no-trim dna_r9.4.1_e8_sup@v3.6 PAI75040.pod5 > PAI75040.basecalls.bam

Log:

[2024-10-03 20:36:13.394] [info] Running: "basecaller" "--no-trim" "dna_r9.4.1_e8_sup@v3.6" "PAI75040.pod5"
[2024-10-03 20:36:13.933] [info] > Creating basecall pipeline
[2024-10-03 20:36:14.806] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model dna_r9.4.1_e8_sup@v3.6 and chunk size 1440. Full benchmarking will run for this device, which may take some time.
[2024-10-03 20:36:24.254] [info] cuda:0 using chunk size 10000, batch size 1152
[2024-10-03 20:36:24.736] [info] cuda:0 using chunk size 5000, batch size 2048
[...]

Hi @diego-rt ,

Dorado is just automatically detecting optimal batch sizes for your decvice - as per the log Full benchmarking will run for this device, - if you let it run it should detect optimal batch sizes. Is it failing in some way?

Thanks for the quick reply! It wasn't failing but since it is parallelized across many jobs I would rather not have it optimise it in every node. It was also unclear to me how long this process takes and when exactly did actual basecalling start.

@diego-rt,

Looking at your log above, benchmarking took about 10 seconds in this case (the time difference between the warning and the announcement of the selected batch size). If you wish to override this value and skip the benchmarking, you can add the -b parameter with your chosen batch size.

Oh I see! Not much of an issue then and thanks a lot for the info.

And sorry to bother yet again but now I have an issue on this same dataset when running demultiplexing which leads to some kind of failure. The dataset was generated with R9 and LSK109 and EXP-NBD114 for barcoding.

This is the command:

dorado basecaller dna_r9.4.1_e8_sup@v3.6 PAI75040.pod5 --kit-name EXP-NBD114 > PAI75040.basecalls.bam

And this is the log:

INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
[2024-10-07 14:28:26.067] [info] Running: "basecaller" "dna_r9.4.1_e8_sup@v3.6" "PAI75040.pod5" "--kit-name" "EXP-NBD114"
[2024-10-07 14:28:26.345] [info] > Creating basecall pipeline
[2024-10-07 14:28:27.241] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-PCIE-40GB", model dna_r9.4.1_e8_sup@v3.6 and chunk size 1440. Full benchmarking will run for this device, which may take some time.
[2024-10-07 14:28:37.510] [info] cuda:0 using chunk size 10000, batch size 1152
[2024-10-07 14:28:38.042] [info] cuda:0 using chunk size 5000, batch size 2240
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Trim interval 65-58 is invalid for sequence TGCTTCGTTCGTTTACGTATTGCCTAAGGTTAAAGAACGACTTCCATAGTCGTGTGACAGCACCTACGTAACTGAGC
/scratch-cbe/users/diego.terrones/3_TExpression/1_limbRegeneration/1_process10xFivePrime/2_processNanopore/1_basecalling/2b/31b46dba16c697d91319ed4fc3dc74/.command.sh: line 3: 11186 Aborted                 dorado basecaller dna_r9.4.1_e8_sup@v3.6 PAI75040.pod5 --kit-name EXP-NBD114 > PAI75040.basecalls.bam

I also tried running the demux after basecalling (with --no-trim) but this resulted in everything being called as unclassified.

@diego-rt, this is fixed in dorado 0.8.1. See #1020.

You don't state what arguments you use for demux, but if you classify during basecalling then you need to use --no-classify instead of --kit-name for the second part. If you have further problems with this, please raise a new issue.