Convert large genetic VCF files into FASTA files corresponding to individual's protein sequences. This repo will handle complex combinations of coding variants scaling to biobank-size genetic sequencing datasets.
git clone https://github.com/barneyhill/aminos
cd aminos
pip install -r requirements.txt
python3 aminos.py --vcf [path-to-vcf] --gff [path-to-gff] --fasta [path-to-fasta] --output [output-directory]
-
A GFF3 file containing genomic features. Currently, only Ensembl GFF3 files are supported, see for example
ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens
. -
A VCF file containing phased variant calls. The VCF file should be generated by BCF/csq
-
A reference fasta file containing transcript ids and the protein sequences of each transcript, for example,
>TRANS_ID
TRANS_SEQ_LINE1
TRANS_SEQ_LINE2
>TRANS_ID
TRANS_SEQ_LINE1
.
.
.
aminos will write to [output-directory]/[transcript].fa.gz. Within these files will be the variant sequences corresponding to a comma seperated list of associated samples. These samples are formatted {individual_id}_{haplotype}.
- missense
- inframe_deletion
- inframe_insertion
barney.hill@ndph.ox.ac.uk