MNP representation in vcf output
migbro opened this issue · 3 comments
I am working on creating consensus calls in which we use lancet as one of our callers. While working on the issue of MNPs, I noticed that, from lancet, MNP calls include the leading reference base on all calla, meaning a phased call of CA -> GC
would be output as TCA -> TGC
. This causes some issues in consensus calling, since Mutect2 and VarDict would not include that leading reference base. It also threw off our benchmarking number before this, as similarly , in the truth set, without an adjustment, it would cause MNPs to be seen as misses, when they were indeed hits. Lastly, it also cases a problem using annotation, for instance, an unadjusted call like
11 | 124440144 | GGG | GAA | MODERATE | OR8A1 |
---|
Would have no SIFT prediction, while after adjustment:
11 | 124440145 | GG | AA | MODERATE | OR8A1 |
---|
Would have a SIFT annotation of deleterious(0)
I can easily make this adjustment myself, but I am curious as to why lancet represents MNPs this way. I am a fan of the caller by the way, and hope this question helps make it better!
Thank you for your feedback. Glad to hear that you are a fan of the tool !
There is no specific reason why we adopted this format except for the fact that we were treating these multi-base mutations similarly to indels, for which the VCF format indeed requires to add the preceding base. I agree with you that it would be more convenient to remove the leading base for those to simplify comparisons to other callers. We will add this feature request to the next release of the software.
I am a little be confused about the annotation problem that you list. Since the two representations give rise to the same final sequence, they should be identical for downstream analysis. Annotation software should be smart enough to utilize the actual sequence rather than just the genomic position. Although, the SIFT prediction software does not seem to do that in your case.
Hi @gnarzisi,
This is true, smarter software should be able to account for that. The situation described above occurred while using Ensembl's Variant Effect Predictor (VEP). It is surprising that this would occur given how long that software has been around. The good news is that, while going through my consensus calling process, the indel normalization step ends up treating those mnps, like indels, and when left-aligned, that leading base is removed. I suppose then we could consider this more of a bug for VEP software, and perhaps as an individual user, a warning not to blindly use the chromosome positions. Your logic makes sense and I suspected as much given that mnps are pretty much same-length insertions, so the same mechanism is probably employed. Maybe just a note in the docs/README would be good enough? Thanks for your response!
Update: I missed that comment that you were planning on implementing my suggestion next release. Sounds good! Feel free to close this issue, or close it when the update is made. Thanks again!
Glad to hear that it was an easy fix for you. Closing the ticket but feel free to re-open if needed.