FelixKrueger/Bismark

Question about Bismark's Handling of Mismatches and Deletions

Closed this issue · 1 comments

Hello,

First of all, thank you for developing and maintaining such a great tool! I’ve been using Bismark to analyze CpG, CHG, and CHH sites, and I noticed something curious: when there are mismatches, Bismark seems to rely on the reference genome, but when there are deletions, it appears to rely on the reads. I understand that this may be due to the alignment algorithm, but I would like to understand the reasoning behind this behavior in more detail.

Could you explain why Bismark prioritizes the read sequence in the case of deletions, instead of following the same logic as for mismatches, where the reference genome is used? Could this behavior potentially affect certain analyses, particularly when handling specific regions with frequent mismatches or indels?

I appreciate your time and any clarification you can provide!

Hi @hmu-youbai,

I have written down some thoughts on this behaviour and its potential downstream consequences here: https://felixkrueger.github.io/Bismark/faq/context_change/

Maybe as short attempt of our thinking of a rationale for this behaviour (which is now well than a decade ago...):
Generally, there is a conscious decision to base methylation calls on the reference genome, because this in common for all samples analysed, and sequencing errors and/or mutations do not produce wildly different coordinates.
With the introduction of Indel capable aligners (Bowtie2, HISAT2 or minimap2), we had to think about insertions and deletions as well. Insertions can be part of sequencing errors (even though the percentage of these is fairly low in Illumina data) or e.g. homo-polymer stretches, and it isn't immediately clear how to handle these. Insertions could either lead to context changes (with potentially low confidence), or even to the introduction of new bases that could essentially lead to a shift of the genome coordinate system (e.g. for and inserstion ACG). We therefore chose to mark insertions with XXX for the context calling, meaning that Cs right adjacent to an insertion will receive an Unknown context and thus not take part in downstream analysis at all. Equally, inserted Cs are ignored as well.

For deletions on the other hand, the position of the cytosine as such is the same as in the genome, and if anything then the context tends to change from CG to CHG or CHH, or vice versa. In mammalian systems this typically results in a status change from methylated to unmethylated (or the other way round), and this can amount to a substantial fraction of calls right next to a deletion. So in essence, deletions 'only' result in a potential context change, but they do not results in a shift of the coordinate system, and they are not reliant on introduced basecall qualities.

It is not a perfect solution, but of all the options it seemed to be the best compromise. Does this answer your question?