ProHap & ProVar

Proteogenomics database-generation tool for protein haplotypes and variants

Input & Usage

Required ingredients:

GTF annotation file (Ensembl - downloaded automatically by Snakemake)
cDNA FASTA file (Ensembl - downloaded automatically by Snakemake)
(optional) ncRNA FASTA file (Ensembl - downloaded automatically by Snakemake)
ProHap: VCF with phased genotpyes, one file per chromosome (such as 1000 Genomes Project - downloaded automatically by Snakemake)
ProVar: VCF, single file per dataset

Required software: Snakemake & Conda

Usage:

Create a configuration file called config.yaml based on the instructions in config_file_example
Test Snakemake with a dry-run: snakemake -c<# provided cores> -n -q
Run the Snakemake pipeline to create your protein database: snakemake -c<# provided cores> -p --use-conda

Analyzing the peptide-spectrum matches

Once you obtain a list of peptide-spectrum matches (PSMs), you can use a script provided in this repository (peptides_annotate_variation.py) to map the peptides back to the respective protein haplotype / variant sequences, and map the identified variants back to their genetic origin. An example of the required input file (tab-separated):

PSMId Sequence Proteins Positions
psm_1 ALPCGHCPEEWITYSNSCYYIGK prot_f5d7;prot_22054;prot_3712e;prot_3ac94;prot_4055f 110;110;110;110;110
psm_2 LGCVLMAWALYLSLGVLWVAQMLLELFPAPILR prot_b40;prot_272f;prot_27a25;prot_2c28b;prot_2dac8 30;2;19;2;30
psm_3 ISVGVAGDLNTVTMK prot_b40;prot_3c88;prot_9b74;prot_9cf6;prot_b882 15;4;15;4;4

The annotation script can be used with the following parameters:

python src/analysis/peptides_annotate_variation.py 
    -i <input file of PSMs> 
    -hap <table of haplotypes produced by ProHap, if haplotypes are included in the search database> 
    -var <table of variants produced by ProVar, if individual variants are included in the search database> 
    -f <optimized FASTA file provided by ProHap/ProVar> 
    -tr_id data/protein_transcript_ids_108.csv <protein and transcript ID mapping - provided in this repository for Ensembl v.108> 
    -g_id data/gene_transcript_ids_108.csv <transcript and gene ID mapping - provided in this repository for Ensembl v.108> 
    -t <# threads> 
    -o <output filename> 
    -log <log file>

Output

FASTA protein database

The protein sequences are first split into sub-sequences by start and stop codon positions, and then duplicate sequences are aggregated into one FASTA entry. The resulting file has the following format:

>tag|accession|<positions_within_protein> <protein_IDs> <protein_starts> <matching_proteins> <reading_frames>
PROTEINSEQUENCE

Possible tag values are:

generic_cont: At least one of the matching sequences belongs to a contaminant.
generic_ref: No matching contaminant, at least one of the matching sequences belongs to a canonical protein.
generic_var: No matching contaminant or canonical protein, at least one of the matching sequences belongs to a variant protein.
generic_hap: No matching contaminant, canonical or variant protein, all of the matching sequences belong to a non-canonical protein haplotype.

The tag values for haplotypes and variants are customizable in the config file.

The fields included in the description of the FASTA elements are the following:

positions_within_protein: position of matching sub-sequences within the whole protein sequence, delimited by semicolon
protein_IDs: IDs of the sub-sequences after splitting the whole protein (redundant)
protein_starts: positions of the start residue (usually M) within the whole protein, if known (0 otherwise)
matching_proteins: IDs of the whole protein sequences matching to this sub-sequence. Variant and haplotype IDs can be mapped to the metadata table provided.
reading_frames: Reading frames in which the matching proteins are translated, if known.

Metadata table

Metadata file provided in a tab-separated text-file format. The columns are: