Several avenues for implementation were considered:
- try and annotate variant consequences myself by investigating effects on transcription and translation.
- avoid the weeds of genomic to tx projections and codon tables by using the hgvs package.
This solution implements option (3).
python >= 3.7
$ python setup.py install
Performance of hgvs can be greatly improved by installing a local instance of seqrepo ( takes ~20 min )
$ pip install seqrepo
$ seqrepo init
$ seqrepo pull -i 2019-06-20
If seqrepo installed:
$ export HGVS_SEQREPO_DIR=/usr/local/share/seqrepo/2019-06-20
Example call used to obtain annotations.csv
:
$ python -m tempus annotate test/data/Challenge_data.vcf annotations.csv --workers 5 --chunk-size 1000
Few of my questions and concerns regarding the challenge statement:
table vs annotated VCF:
[..] output a table annotating each variant in the file. upload [..] along with the annotated VCF file [..]
Assuming that CSV should suffice.
type of variation:
(Substitution, Insertion, Silent, Intergenic, etc.) [..] annotate with the most deleterious possibility
This list is likely intentionally vague. Genetics are somewhat mixed with genomics here - variant type and function can be orthogonal. Will try to make some safe assumptions and implement just some of the basic functional nomenclature possible.
- Parallelism via splitting the vcf into chunks and letting separate processes chug away is rather inelegant.
- Threads or coroutines would make more sense for the network-bound IO characteristic of this annotator. But seqrepo is not thread-safe and I didn't feel like digging too deep into the issues there. Coroutines could certainly be considered, but again I just wanted a quick and dirty solution and keep the library APIs maximally transparent.
- Some extra effort went into 5'-normalizing (left shuffle) variants to query ExAC.
- But the gains from such normalization were never assessed and could well be negligible.
- PyVCF is a bit rough around the edges.
- For example, pickling _Record(s) (objects that represent loci) destroys some of the INFO and GT data inside. Writing is also no joke. Perhaps, this is why CVS table is good enough for now :)
- jupyter notebooks found in
nb/
are merely scratch paper.