Proteogenomics database-generation tool for protein haplotypes and variants
Required ingredients:
- GTF annotation file (Ensembl - downloaded automatically by Snakemake)
- cDNA FASTA file (Ensembl - downloaded automatically by Snakemake)
- (optional) ncRNA FASTA file (Ensembl - downloaded automatically by Snakemake)
- ProHap: VCF with phased genotpyes, one file per chromosome (such as 1000 Genomes Project - downloaded automatically by Snakemake)
- ProVar: VCF, single file per dataset
Required software: Snakemake & Conda
Usage:
- Create a configuration file called
config.yaml
based on the instructions inconfig_file_example
- Test Snakemake with a dry-run:
snakemake -c<# provided cores> -n -q
- Run the Snakemake pipeline to create your protein database:
snakemake -c<# provided cores> -p --use-conda
Once you obtain a list of peptide-spectrum matches (PSMs), you can use a script provided in this repository (peptides_annotate_variation.py) to map the peptides back to the respective protein haplotype / variant sequences, and map the identified variants back to their genetic origin. An example of the required input file (tab-separated):
PSMId Sequence Proteins Positions
psm_1 ALPCGHCPEEWITYSNSCYYIGK prot_f5d7;prot_22054;prot_3712e;prot_3ac94;prot_4055f 110;110;110;110;110
psm_2 LGCVLMAWALYLSLGVLWVAQMLLELFPAPILR prot_b40;prot_272f;prot_27a25;prot_2c28b;prot_2dac8 30;2;19;2;30
psm_3 ISVGVAGDLNTVTMK prot_b40;prot_3c88;prot_9b74;prot_9cf6;prot_b882 15;4;15;4;4
The annotation script can be used with the following parameters:
python src/analysis/peptides_annotate_variation.py
-i <input file of PSMs>
-hap <table of haplotypes produced by ProHap, if haplotypes are included in the search database>
-var <table of variants produced by ProVar, if individual variants are included in the search database>
-f <optimized FASTA file provided by ProHap/ProVar>
-tr_id data/protein_transcript_ids_108.csv <protein and transcript ID mapping - provided in this repository for Ensembl v.108>
-g_id data/gene_transcript_ids_108.csv <transcript and gene ID mapping - provided in this repository for Ensembl v.108>
-t <# threads>
-o <output filename>
-log <log file>
The protein sequences are first split into sub-sequences by start and stop codon positions, and then duplicate sequences are aggregated into one FASTA entry. The resulting file has the following format:
>tag|accession|<positions_within_protein> <protein_IDs> <protein_starts> <matching_proteins> <reading_frames>
PROTEINSEQUENCE
Possible tag values are:
generic_cont
: At least one of the matching sequences belongs to a contaminant.generic_ref
: No matching contaminant, at least one of the matching sequences belongs to a canonical protein.generic_var
: No matching contaminant or canonical protein, at least one of the matching sequences belongs to a variant protein.generic_hap
: No matching contaminant, canonical or variant protein, all of the matching sequences belong to a non-canonical protein haplotype.
The tag values for haplotypes and variants are customizable in the config file.
The fields included in the description of the FASTA elements are the following:
positions_within_protein
: position of matching sub-sequences within the whole protein sequence, delimited by semicolonprotein_IDs
: IDs of the sub-sequences after splitting the whole protein (redundant)protein_starts
: positions of the start residue (usually M) within the whole protein, if known (0 otherwise)matching_proteins
: IDs of the whole protein sequences matching to this sub-sequence. Variant and haplotype IDs can be mapped to the metadata table provided.reading_frames
: Reading frames in which the matching proteins are translated, if known.
Metadata file provided in a tab-separated text-file format. The columns are:
chromosome
TranscriptID
transcript_biotype
: Biotype of the matching transcript in Ensembl.HaplotypeID
: ID of the haplotype sequence, matching to the ID in the FASTA entry description.VCF_IDs
: IDs of the matching lines in the VCF file of providedDNA_changes
: List of changes in the format POS:REF>ALT, mapped to the DNA coordinates within the chromosomeallele_frequencies
: List of allele vrequencies of the variants in cluded in this haplotypecDNA_changes
: List of changes in the format POS:REF>ALT, mapped to the coordinates within the cDNA of this transcriptall_protein_changes
: List of changes in the format POS:REF>ALT, mapped to the coordinates within the protein sequence. The start codon is at position 0, so if a change happens in the 5' untranslated region (UTR), its coordinates within the protein are negative.protein_changes
: List of changes in the protein excluding synonymous mutations.reading_frame
: Canonical reading frame for this transcript, if known.protein_prefix_length
: Number of codons in the 5' UTRsplice_sites_affected
: List of splice sites affected by a mutation, if any. (Splice site 0 happens between exon 1 and 2)occurrence_count
: Number of occurrences of this haplotype within the participants of the 1000 Genomes project (or within the individuals provided in the phased genotype VCF)frequency
: Frequency of this haplotype within the participants of the 1000 Genomes project (or within the individuals provided in the phased genotype VCF)