Different results: dorado v0.7 vs v0.8 -- modified bases, 5mc, 5hmc

Question

Different results: dorado v0.7 vs v0.8 -- modified bases, 5mc, 5hmc

szakallasn3 opened this issue 3 months ago · 5 comments

Issue Report

Dear All,
Recently we upgraded our dorado basecaller to the newest available version v0.8. As this update makes possible to call 4mc modified bases on sequencing data, I made a test run to investigate the performance of the new hac, 4mC_5mC model. It works fine, however a little bit slower than the hac, 5mC_5hmC which I used previously. Anyway, what I experienced is that the number and the fraction of modified 5mC derived different compared to those results which I gained using dorado v0.7 and model hac, 5mC_5hmC.

Encouraged by this, today I started to run the newest dorado version v0.8 with the model of hac, 5mC_5mC on the same dataset. As soon it completed I will present modified base results and compare to those I got when running hac, 5mC_5hmC model and dorado v0.7. I plan to share the two modkit data with you after I completed the process (I guess within 24 hours).

The results are the following:

dorado v0.8 hac,5mC_5hmC:
bases C
total_reads_used 10042
count_reads_C 10042
base code pass_count pass_frac all_count all_frac
C - 5539808 0.97149116 6052222 0.93040127
C m 113837 0.019963082 317491 0.04880753
C h 48731 0.008545736 135246 0.020791214
dorado v0.7 has, 5mC_5hmC
bases C
total_reads_used 13873
count_reads_C 13873
base code pass_count pass_frac all_count all_frac
C m 41398 0.31157944 60683 0.30784488
C - 83279 0.6267941 113766 0.57713497
C h 8188 0.061626464 22673 0.11502014

As you can see, the fraction and number of modified bases are significantly different using the distinct versions of Dorado.

Am I doing something wrong or do you performed some changes in the fundamental settings of Dorado?

I am looking forward any suggestions regarding the issue

Steps to reproduce the issue:

commands:

dorado basecaller hac,4mC_5mC /input/pod5/directory/ -x cuda:0 --reference /reference/genome/.fa > /output/basecalled/bam
samtools sort /input/basecalled/bam -o /output/sorted/bam
samtools index -b /input/sorted/bam
modkit pileup /input/sorted/bam /output/bed --combine-mods --ref /reference/genome/.fa
modkit summary /input/sorted/bam --filter-threshold value_determined_by_modkit_pileup > /output/modkitsum/txt

Run environment:

Dorado version: v0.8
Dorado command: the above-listed commands with dorado v0.7 and dorado v0.8 (nothing else changed)
Operating system: Ubuntu 24.04 LTS

Answer 1 · 2024-10-02T12:35:11.000Z

Hi @szakallasn3,

Can I just confirm - you're running the same data in 0.7 and 0.8, both with 5mC_5hmC modifications? (Since your command shows the new 4mC_5mC model...) Presumably this means 0.7 downloaded the v1 model, and 0.8 downloaded the v2? The v2 model has significant improvements around mis-calls of modifications near real modifications, and so should be expected to give fewer mod calls (though probably not by that large a margin).

This is possibly related to #1059.

Finally, note: the models give the relative probabilities of the modifications within the model, not absolute values, i.e. the value is P(m|Cmh) - the probability of an m given it must be either C, m or h. So comparing the probability of a 5mC between the 4mC_5mC and the 5mC_5hmC model may give different likelihoods.

Answer 2 · 2024-10-02T13:17:48.000Z

Hi @malton-ont,
Many thanks for your reply.
Yes, I can confirm, that finally I ran 0.7 5mC_5hmC and 0.8 5mC_5hmC on the same data (I mentioned 4mC_5mC also because I noticed the difference regarding 5mC values in that context).
I understand your explanation regarding v1 and v2 models, but it is still unclear for me that how could they generate such different results?
V1 showed that I had 41398 5mC calls with a relative fraction 0.311, while the v2 outputted 113837 calls and 0.02 relative probability (from the same data!). So the question risen: what is the real result? An 0.02 relative fraction of 5mC is really suspicious for me.

Answer 3 · 2024-10-02T14:08:37.000Z

This is likely related to the bug in release where we have unintentionally swapped the CG-context and all-context models in the 0.8 release. Thus the large differences you are seeing are most likely due to calls in non-CG contexts causing both more 5mC calls and a much lower proportion of 5mC calls. See #1059 to track the issue. As a workaround running the all-context model (in order to get CG-context calls) should resolve this issue and produce much more consistent results. Apologies for this mixup this will be fixed in the next Dorado release coming out shortly.

Answer 4 · 2024-10-04T07:49:54.000Z

Hi @marcus1487 ,
Thank you for the reply.
I will take a look on issue #1059 and looking forward for the upcoming Dorado release.

Answer 5 · 2024-10-04T10:56:25.000Z

Fixed in v0.8.1.