c_to_g(intronic_variant) takes genomic reference, ignoring reference base in hgvs_c
yxuil opened this issue · 1 comments
Normally it shouldn't be a problem if hgvs_c variant is valid. However when a given intronic variant has incorrect reference base, c_to_g will "correct" it. It might not be desired.
Following codes demonstrate that by an intronic variant with incorrect reference base (should be G instead of T)
In [1]: v2 = hp.parse("NM_001271.3:c.2190-4T>A")
...: v2_g = c_to_g(v2)
...: v2_c = g_to_c(v2_g, "NM_001271.3")
...: v2, v2_g, v2_c
Out[1]:
(SequenceVariant(ac=NM_001271.3, type=c, posedit=2190-4T>A, gene=None),
SequenceVariant(ac=NC_000015.10, type=g, posedit=92971761G>A, gene=None),
SequenceVariant(ac=NM_001271.3, type=c, posedit=2190-4G>A, gene=None))
Found this when trying to use hv.validate(c_to_g(variant)) for intronic variant validation to circumvent HGVSInvalidVariantError: Cannot validate sequence of an intronic variant
error.
Sorry for the delay. I missed this post.
Round-tripping this operations (i.e., c→g→c) would be difficult to address.
For the first variant (v2), it's impossible to verify the T in c.2190-4T>A because that transcripts do not include intronic sequence. When v2 is projected to the genome, the reference nucleotide is known and is used.
In the subsequent g_to_c operation, the v2_g variant is valid and projects normally to transcript (warning that the sequence can't be validated).
So, the only way we could make this round-trip operation preserve the original transcript reference is to somehow convey that incorrect reference to the genomic variant. That is, the genomic variant would have to have some state that declared that the transcript variant from which it was derived had a bogus reference.
In my opinion, this goal is dubious and not worth the significant time that it would take to implement.