andrewhill157/leiden

Look into changing to VEP Remapping

Closed this issue · 2 comments

Apparently VEP can take HGVS formatted variants with the --refseq flag turned on (assuming you have the human refseq cache downloaded). By specifying VCF as the output format, you can effectively remap variants to VCF.

At first glance it appears to be able to handle more intricate notations and more refseq transcripts than the current approach. This is also simple to distribute as a technique because EVERYONE has access to VEP. This strengthens the case for migrating to the standard VEP annotation format rather than Monkol's hacked version.

More testing is required, but I think this is easily the best option moving forward. Additional information about frameshift variants, etc. could be written as a VEP plugin rather than a stand-alone Perl script and simply be incorporated into the annotation phase.

Note that invalid notations are simply dropped by VEP. To keep the original LOVD data could write out file with only the HGVS notation (if extra columns not allowed), annotate, and then use pandas to join on HGVS notation as the key with LOVD data or something.

Basic workflow would be:

  1. Download data from LOVD (released as PyPI package, referencing potential downstream processing)
  2. Annotate with VEP, specifying VCF format output and any plugins
  3. Join annotation and LOVD data and output new VCF with all fields
  4. Validate to output final VCFs with ONLY validated variants
  5. Additional annotations can be added as needed (26K, HGMD, etc.) later on. Up to end user.

Implemented. Will either commit to this repo or start new one for only validation.