ML tag can’t show the correct skipped sites number in Nanopore sequencing
PanZiwei opened this issue · 1 comments
Hi,
I have a close check at the ML tag saved in the bam file after Guppy base-calling. I think there is something wrong with the current strategy for ML tag saving and here is my conclusion (Correct me if I am wrong):
ML tag can’t show the correct number of how many seq bases of the stated base type. In other words, ML tag DOESN'T know whether there are skipped base sites after the last modified base
I will simplify the question with the assumption that the modified base is C.
Based on my observation, the ML tag can’t show whether there are skipped C sites after the last modified C.
For instance, if there is only 1 modified C and it is the 3rd C, I think the two sequences: ACCTCGCCA
and ACCTCGA
will have the same MM tag MM:Z:C+m?,2. In other words, MM:Z:C+m?,2 might have multiple sequence correspondence. (Correct me if I am wrong).
Another example is from https://samtools.github.io/hts-specs/SAMtags.pdf
When the ‘?’ flag is present the tag ‘C+m?,5,12,0;’ tells us the modification status of the first five-cytosine bases is unknown, the sixth cytosine is called (as either modified or unmodified), followed by 12 more unknown cytosines, and the 19th and 20th are called.
However, the tag above DOESN'T give the information 1) whether there are cytosines after 20th C 2) How many cytosines after 20th C
If that’s the case, I think an extra number should be introduced as the last element of the MM tag to label the number of skipped C sites to distinguish the circumstances.
For example, the MM tag for ACCTCGCCA
should be MM:Z:C+m?,2,2
(2 skipped C after the 3rd modified C), and the MM tag for ACCTCGA
should be MM:Z:C+m?,2,0
(No C after the 3rd modified C). The corresponding ML will stay the same with only 1 element ML:B:C: 256
to show the modified probability of the 3rd C.
And there will be a relationship: The number of skipped C sites Nskip
>= The sum of items saved in MM tagSUM(MM_tag)
So the question will become: How to solve the current inconsistency Nskip
and SUM(MM_tag)
due to the MM tag issue? Is there a way to use ML tag to show the correct number of skipped C in the prediction?
Thank you so much for your help!