nanoporetech/dorado

Different results: dorado v0.7 vs v0.8 -- modified bases, 5mc, 5hmc

szakallasn3 opened this issue ยท 5 comments

Issue Report

Dear All,
Recently we upgraded our dorado basecaller to the newest available version v0.8. As this update makes possible to call 4mc modified bases on sequencing data, I made a test run to investigate the performance of the new hac, 4mC_5mC model. It works fine, however a little bit slower than the hac, 5mC_5hmC which I used previously. Anyway, what I experienced is that the number and the fraction of modified 5mC derived different compared to those results which I gained using dorado v0.7 and model hac, 5mC_5hmC.

Encouraged by this, today I started to run the newest dorado version v0.8 with the model of hac, 5mC_5mC on the same dataset. As soon it completed I will present modified base results and compare to those I got when running hac, 5mC_5hmC model and dorado v0.7. I plan to share the two modkit data with you after I completed the process (I guess within 24 hours).

The results are the following:

  • dorado v0.8 hac,5mC_5hmC:
    bases C
    total_reads_used 10042
    count_reads_C 10042
    base code pass_count pass_frac all_count all_frac
    C - 5539808 0.97149116 6052222 0.93040127
    C m 113837 0.019963082 317491 0.04880753
    C h 48731 0.008545736 135246 0.020791214

  • dorado v0.7 has, 5mC_5hmC
    bases C
    total_reads_used 13873
    count_reads_C 13873
    base code pass_count pass_frac all_count all_frac
    C m 41398 0.31157944 60683 0.30784488
    C - 83279 0.6267941 113766 0.57713497
    C h 8188 0.061626464 22673 0.11502014

As you can see, the fraction and number of modified bases are significantly different using the distinct versions of Dorado.

Am I doing something wrong or do you performed some changes in the fundamental settings of Dorado?

I am looking forward any suggestions regarding the issue

Steps to reproduce the issue:

commands:

  • dorado basecaller hac,4mC_5mC /input/pod5/directory/ -x cuda:0 --reference /reference/genome/.fa > /output/basecalled/bam
  • samtools sort /input/basecalled/bam -o /output/sorted/bam
  • samtools index -b /input/sorted/bam
  • modkit pileup /input/sorted/bam /output/bed --combine-mods --ref /reference/genome/.fa
  • modkit summary /input/sorted/bam --filter-threshold value_determined_by_modkit_pileup > /output/modkitsum/txt

Run environment:

  • Dorado version: v0.8
  • Dorado command: the above-listed commands with dorado v0.7 and dorado v0.8 (nothing else changed)
  • Operating system: Ubuntu 24.04 LTS

Hi @szakallasn3,

Can I just confirm - you're running the same data in 0.7 and 0.8, both with 5mC_5hmC modifications? (Since your command shows the new 4mC_5mC model...) Presumably this means 0.7 downloaded the v1 model, and 0.8 downloaded the v2? The v2 model has significant improvements around mis-calls of modifications near real modifications, and so should be expected to give fewer mod calls (though probably not by that large a margin).

This is possibly related to #1059.

Finally, note: the models give the relative probabilities of the modifications within the model, not absolute values, i.e. the value is P(m|Cmh) - the probability of an m given it must be either C, m or h. So comparing the probability of a 5mC between the 4mC_5mC and the 5mC_5hmC model may give different likelihoods.

Hi @malton-ont,
Many thanks for your reply.
Yes, I can confirm, that finally I ran 0.7 5mC_5hmC and 0.8 5mC_5hmC on the same data (I mentioned 4mC_5mC also because I noticed the difference regarding 5mC values in that context).
I understand your explanation regarding v1 and v2 models, but it is still unclear for me that how could they generate such different results?
V1 showed that I had 41398 5mC calls with a relative fraction 0.311, while the v2 outputted 113837 calls and 0.02 relative probability (from the same data!). So the question risen: what is the real result? An 0.02 relative fraction of 5mC is really suspicious for me.

This is likely related to the bug in release where we have unintentionally swapped the CG-context and all-context models in the 0.8 release. Thus the large differences you are seeing are most likely due to calls in non-CG contexts causing both more 5mC calls and a much lower proportion of 5mC calls. See #1059 to track the issue. As a workaround running the all-context model (in order to get CG-context calls) should resolve this issue and produce much more consistent results. Apologies for this mixup this will be fixed in the next Dorado release coming out shortly.

Hi @marcus1487 ,
Thank you for the reply.
I will take a look on issue #1059 and looking forward for the upcoming Dorado release.

Fixed in v0.8.1.