Threshold for methylation detection
Closed this issue · 8 comments
Hi, sorry to bother you again
Software like Nanopolish, Tombo, and Guppy often utilize a threshold when detecting DNA methylation modifications (5mC), for example:
Tombo: use the recommended CpG motif-specific model with the default threshold of (− 1.5, 2.5) for DNA where scores below − 1.5 were considered as methylated and above 2.5 unmethylated, and scores between these thresholds did not contribute to the per-site methylation.
Nanopolish: methylated CpGs (LLR > 2.0) and unmethylated CpGs (LLR < -2.0)
Nanopolish: unmethylated (methylation rates: 0–0.15), lowmethylated (0.15–0.5), intermethylated (0.5–0.85), methylated (0.85–1).
guppy: methylated CpGs (score > 128) and unmethylated CpGs (score < 64)
Does Dorado employ a threshold when detecting RNA or DNA methylation modifications? If so, what are the ranges of these thresholds for each type?
Is this threshold automatically applied during the Dorado basecalling
process? Is it based on the probabilities stored in the BAM file's ML
for filtering?
Can all minimap2 parameters be implemented using --mm2-opts
?
can I use this parameter when basecalling
--mm2-opts "-ax splice -uf -k15"
#my script
dorado basecaller \
-x cuda:0 \
--estimate-poly-a \
--verbose \
--emit-moves \
--mm2-opts "-ax splice -uf -k15" \
--min-qscore 7 \
--reference /home/ZHH/reference/hualu_T2T/genomic.fna_revise \
--modified-bases-models /home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0_m5C@v1,/home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0_m6A_DRACH@v1,/home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0_pseU@v1 \
/home/ZHH/software/dorado/model/rna004_130bps_sup@v5.1.0 \
/bio/public_dir/zhh_data/DRS_20240719_zhh/data_BC202406118-BN240605QD02S41N1/pod5/LM_0_1+2+3/pod5_pass/ \
> /bio/ZHH/sup_drs/2.1_bam/0_A.filtered.bam
When detecting methylation, is it necessary to consider secondary alignments? Should I need to remove the secondary alignments?
How can the generated polyA information be extracted? Can it be specified down to the corresponding reads or transcripts, or is it only possible to get a rough overview of the distribution?
In the SAM tags, are m5C and m6A RNA modifications represented as "m" and "a," respectively?
Is this the same for BAM files? In SAM specification file, The symbols for RNA modifications m6A and m5C are not clearly specified, so how can we distinguish between DNA modifications and RNA modifications? When performing methylation detection on RNA and DNA data separately using Dorado basecall, how should the methylation modification information stored in the MM field be differentiated?
Hi @Taylorain,
Does Dorado employ a threshold when detecting RNA or DNA methylation modifications? If so, what are the ranges of these thresholds for each type? Is this threshold automatically applied during the Dorado basecalling process?
Yes - the threshold range is [0, 1] and the default threshold is 0.05 as shown in the --help
:
Modified model arguments (detailed usage):
--modified-bases A space separated list of modified base codes. Choose from: pseU, m6A_DRACH, m6A, 6mA, m5C, 5mC, 5mCG_5hmCG, 5mCG, 5mC_5hmC, inosine_m6A, 4mC_5mC. [nargs: 1 or more]
--modified-bases-models A comma separated list of modified base model paths. [nargs=0..1] [default: ""]
> --modified-bases-threshold The minimum predicted methylation probability for a modified base to be emitted in an all-context model, [0, 1]. [default: 0.05]
Is it based on the probabilities stored in the BAM file's ML for filtering?
I'm not sure what you mean by this question but new BAM files are written with new MM/ML tags based on the mods basecalling being done and the threshold used.
Can I use this parameter when basecalling
--mm2-opts "-ax splice -uf -k15"
If it's a valid option for mm2 then it should work.
When detecting methylation, is it necessary to consider secondary alignments? Should I need to remove the secondary alignments?
This question is not a dorado issue and depends on your work - please ask on the nanopore community forum for more general support.
How can the generated polyA information be extracted? Can it be specified down to the corresponding reads or transcripts, or is it only possible to get a rough overview of the distribution?
From the README#polya-tail-estimation:
"The estimated tail length is stored in the pt:i tag of the output record. Reads for which the tail length could not be estimated will not have the pt:i tag."
In the SAM tags, are m5C and m6A RNA modifications represented as "m" and "a," respectively?
Yes
Is this the same for BAM files?
Yes
In SAM specification file, The symbols for RNA modifications m6A and m5C are not clearly specified, so how can we distinguish between DNA modifications and RNA modifications?
I believe they're ambiguous - The BAM header from dorado will contain metadata / breadcrumbs to be able to work out if an DNA or RNA model was used. However, I'd suggest good file keeping and not mix data types to make your life easier.
When performing methylation detection on RNA and DNA data separately using Dorado basecall, how should the methylation modification information stored in the MM field be differentiated?
I don't understand this question sorry.
Do I need to install TensorFlow and TensorRT before performing basecalling on a GPU? I noticed that generating four 60GB files (240GB in total)takes over a week; is this speed normal?
(base) $ nvidia-smi
Wed Oct 9 17:57:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 Off | 00000000:21:00.0 Off | 0 |
| N/A 61C P0 166W / 165W | 23285MiB / 24576MiB | 99% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2795512 C dorado 22890MiB |
| 0 N/A N/A 3876914 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 4102189 C gmx 366MiB |
+-----------------------------------------------------------------------------------------+
(base) $ lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Do I need to install TensorFlow and TensorRT before performing basecalling on a GPU?
No - Dorado has everything it needs - just follow the install guide.
I noticed that generating four 60GB files (240GB in total)takes over a week; is this speed normal?
You're running the most computationally intensive basecalling pipeline one could conceive of.
By using the largest simplex model (sup v5+) and 3 additional mods models this is going to be quite slow unfortunately. We are always working on improving performance but this is a heavy workflow with not a huge amount of GPU resource to power it.
I get! Thanks pretty much for your great help!
Happy to help @Taylorain - Are you happy to close this issue?