pcingola/SnpEff

Incorrect classification duplication variant as intron coding?

Opened this issue · 1 comments

Hi all,
Thank you for creating this amazing tool.

I am analyzing variants from an old database and remapping them to hg38. I have what seems to be the same variant in two people annotated in two different ways - as an insertion and duplication. Here is the VCF input:

X	41473864	.	G	<DUP>	.	.	PtID=XXXXXX;SVTYPE=DUP;SVLEN=9;END=41473872	GT:DP	1:150
X	41473872	.	T	TGCGCCGCCT	.	.	PtID=YYYYYY;SVTYPE=INS;END=41473873	GT:DP	1:150

Ensembl's VEP correctly classifies the variant as protein coding
SnpEff incorrectly classifies it an intron variant

I am using GRCh38.p14 build
SnpEff version 5.2a
SnpEff command is very standard
java -jar snpEff.jar -d GRCh38.p14 vcf_output_nyx2_sorted.vcf > vcf_output_nyx2_sorted_ann.vcf

For now, I am just converting the short entries into INS entries as a workaround, but I am wondering what is causing this issue and how it can be fixed

Thank you for your help

More detail:

Parsed output from SnpEff:

CHROM   POS ID  REF ALT QUAL    FILTER  FORMAT  NA0001  INFO_PtID   INFO_SVTYPE INFO_SVLEN  INFO_END    Allele  Annotation  Annotation_Impact   Gene_Name   Gene_ID Feature_Type    Feature_ID  Transcript_BioType  Rank    HGVS.c  HGVS.p  cDNA.pos / cDNA.length  CDS.pos / CDS.length    AA.pos / AA.length  Distance    ERRORS / WARNINGS / INFO "> INFO_LOF    INFO_NMD        
X   41473864    .   G   <DUP>   .   .   GT:DP   1:150   XXXXXX  DUP 9   41473872    <DUP>   intron_variant  MODIFIER    NYX NYX transcript  NM_022567.3 protein_coding  1/1 c.                      INFO_REALIGN_3_PRIME    NA  NA
X   41473864    .   G   <DUP>   .   .   GT:DP   1:150   XXXXXX  DUP 9   41473872    <DUP>   intron_variant  MODIFIER    NYX NYX transcript  NM_001378477.3  protein_coding  2/2 c.                      INFO_REALIGN_3_PRIME    NA  NA
X   41473872    .   T   TGCGCCGCCT  .   .   GT:DP   1:150   YYYYYY  INS NA  41473873    TGCGCCGCCT  disruptive_inframe_insertion    MODERATE    NYX NYX transcript  NM_001378477.3  protein_coding  3/3 c.396_404dupGCGCCGCCT   p.Leu135_Asp136insArgArgLeu 635/2414    405/1431    135/476         NA  NA
X   41473872    .   T   TGCGCCGCCT  .   .   GT:DP   1:150   YYYYYY  INS NA  41473873    TGCGCCGCCT  disruptive_inframe_insertion    MODERATE    NYX NYX transcript  NM_022567.3 protein_coding  2/2 c.396_404dupGCGCCGCCT   p.Leu135_Asp136insArgArgLeu 967/2746    405/1431    135/476         NA  NA

It tries to do some weird realignment. From snpEff log in bash:

Variant (original)   : chrX:41473864-41473871[DUP]
Variant (realinged)  : chrX:41472836-41472836[INTERVAL]

Unsure why it's doing this... this should actually be the exact same variant