aminos: A Python repository from barneyhill

Overview

Convert large genetic VCF files into FASTA files corresponding to individual's protein sequences. This repo will handle complex combinations of coding variants scaling to biobank-size genetic sequencing datasets.

Install

git clone https://github.com/barneyhill/aminos
cd aminos
pip install -r requirements.txt

Usage

python3 aminos.py --vcf [path-to-vcf] --gff [path-to-gff] --fasta [path-to-fasta] --output [output-directory]

Input Requirements

A GFF3 file containing genomic features. Currently, only Ensembl GFF3 files are supported, see for example ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens.
A VCF file containing phased variant calls. The VCF file should be generated by BCF/csq
A reference fasta file containing transcript ids and the protein sequences of each transcript, for example,

>TRANS_ID
TRANS_SEQ_LINE1
TRANS_SEQ_LINE2 
>TRANS_ID
TRANS_SEQ_LINE1
.
.
.

Output

aminos will write to [output-directory]/[transcript].fa.gz. Within these files will be the variant sequences corresponding to a comma seperated list of associated samples. These samples are formatted {individual_id}_{haplotype}.

Currently supported consequences

missense
inframe_deletion
inframe_insertion

Contact

barney.hill@ndph.ox.ac.uk