arq5x/gemini

support for additional VEP terms

jxchong opened this issue · 6 comments

Based on the findings of the DDD paper, we would like to be able to filter for the following variant annotations created by the VEP SpliceRegion plugin

splice_donor_5th_base_variant
splice_donor_region_variant
splice_polypyrimidine_tract_variant
extended_intronic_splice_region_variant_5prime
extended_intronic_splice_region_variant_3prime

Info here: http://www.ensembl.info/2018/10/26/cool-stuff-the-vep-can-do-splice-site-variant-annotation/
Plugin here: https://github.com/Ensembl/VEP_plugins/blob/release/94/SpliceRegion.pm

None of these annotations are currently listed in GEMINI's impacts column. How would we be able to access them when they don't have their own custom vep_xxx column (my understanding is that they are just provided by VEP as the annotation)? (right now we just do impact_severity<>'LOW' in GEMINI so I imagine we would have to do impact_severity<>'LOW' or xxxxx='yyy' or ...)

arq5x commented

I honestly think this is the realm of the new gemini workflow based upon vcfanno and vcf2db. Our goal is the switch over to this entirely this year.

Thanks Aaron. If we switch to vcfanno/vcf2b right now, would these be accessible to us in queries?

arq5x commented

If they are in the VCF via vcfanno or VEP, they make it into the database. @brentp - can you corroborate?

I think these would be impacts in the CSQ string, right? e.g. instead of splice_variant it would now be splice_donor_5th_base_variant so we'd have to update the geneimpacts module.

An example VCF with a few variants would be helpful.

Ok, we finally got this working in VEP and these show up in the CSQ string, but not in the Consequence field. They are instead in the SpliceRegionOutput field.

Here's an example. More examples in the VCF available here:
https://www.dropbox.com/s/mg7u3nkxil7p4h5/spliceregionexamples.vcf.gz?dl=0

1    38272660    rs2291297    G    A    42583.1    PASS    AC=1;AF=0.224;AN=2;BaseQRankSum=-1.622;ClippingRankSum=0.271;DB;DP=3988;ExcessHet=0.4621;FS=0.528;InbreedingCoeff=0.1
309;MLEAC=43;MLEAF=0.224;MQ=9.49;MQ0=0;MQRankSum=0;QD=19.89;ReadPosRankSum=0.463;SOR=0.637;CSQ=A|downstream_gene_variant|MODIFIER|MTF1|ENSG00000188786|Transcript|ENST00000373036|protein_coding||||||||||rs2291297|2579|-1||HGNC|7428|YES|CCDS30676.1|1|C1orf122||||||||||||,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000373042|protein_coding|||||||||
|rs2291297|1158|1||HGNC|24789|YES|CCDS427.2||C1orf122||||||||||||,A|5_prime_UTR_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000373043|protein_coding|1/2||ENST00000373043.1:c.
-1697G>A||10/2229|||||rs2291297||1||HGNC|24789||CCDS44112.1||C1orf122||||||||||||,A|intron_variant|MODIFIER|YRDC|ENSG00000196449|Transcript|ENST00000373044|protein_coding||2/4|ENST00000373044.2:c.505-12C>T|||||||rs2291297||-1||HGNC|28905|YES|CCDS30675.1||C1orf122||||||||||||splice_polypyrimidine_tract_variant,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcrip
t|ENST00000419397|processed_transcript||||||||||rs2291297|672|1||HGNC|24789||||C1orf122||||||||||||,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000446260|prot
ein_coding||||||||||rs2291297|1422|1||HGNC|24789||||C1orf122||||||||||||,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000468084|protein_coding||||||||||rs22912
97|759|1||HGNC|24789||CCDS44112.1||C1orf122||||||||||||,A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000004891|promoter||||||||||rs2291297|||||||||C1orf122||||||||||||    GT:AD:DP:GQ:PL    0/1:37,27:.:99:771,0,945
arq5x commented

Gotcha, looks like we would need to update the logic in geneimpacts and in vcf2db to support this.