The relevant data is read from test_vcf_data.txt.
For each variant, the variant IDs are built in HGVS sequence variant nomenclature format and submitted in query blocks of 300 to the VEP HGVS API.
Data is retrievd from the API in JSON format, processed, and then written to the output file, variants.tsv, which contains the following fields,
- chromosome
- position
- reference allele (humanG1Kv37)
- alternative allele
- depth of sequence coverage at the site of variation
- number of reads supporting the variant
- percentage of reads supporting the variant versus those supporting reference reads
- gene of the variant
- variant class
- variant effect
- minor allele frequency
Note that loci with multiple alternative alleles are broken up into separate rows.
Unavailable data is denoted NA
.
CLI run command,
python3 tempus_challenge.py