PrecisionProDB is a Python package for proteogenomics, which can generate a customized protein database for peptide search in mass spectrometry.
The Genome Aggregation Database (gnomAD) project provides variant allele frequencies in different human opulations based on genomes and exomes of hundreds of thousands of individuals. The population-specific common allele information can be integrated into a protein database. We applied PrecisionProDB to alleles from different populations from the gnomAD (v3.1) data and provided the pre-calculated protein databases here.
Several population go
population | abreviation | genomes | # alt > ref |
---|---|---|---|
African/African American | adj_afr | 20,744 | 2,334,582 |
Amish | adj_ami | 456 | 2,435,528 |
Latino/Admixed American | adj_amr | 7,647 | 2,343063 |
Ashkenazi Jewish | adj_asj | 1,736 | 2,375,907 |
East Asian | adj_eas | 2,604 | 2,579,504 |
Finnish | adj_fin | 5,316 | 2,364,824 |
Middle Eastern | adj_mid | 158 | 2,346,917 |
Non-Finnish European | adj_nfe | 34,029 | 2,369,937 |
South Asian | adj_sas | 2,419 | 2,367,008 |
Other | adj_oth | 1,047 | - |
Total | adj | 76,156 | 2,274,088 |
weblink: https://gnomad.broadinstitute.org/faq
# alt
> ref
: count of alleles that the alternative allele have a higher allele frequency (AF) than the allele in the reference genome.
Variants from gnomAD 3.1. Only include sites which is "PASS" in quality-control and the allele frequency of alt
is higher than ref
.
file names: POPULATION_ABREVIATION.csv.gz
files are in csv format (with \t
as separator), which looks like:
chr | pos | ref | alt | alt_AF | ref_AF |
---|---|---|---|---|---|
1 | 10146 | AC | A | 0.6328 | 0.3672 |
1 | 15274 | A | G | 0.6311 | 0.0635 |
1 | 28563 | A | G | 0.6842 | 0.3158 |
1 | 49298 | T | C | 0.5852 | 0.4148 |
1 | 52238 | T | G | 0.9013 | 0.0987 |
GENCODE Release 35 (GRCh38.p13)
https://www.gencodegenes.org/human/release_35.html
GENCODE gene models with alleles from gnomAD 3.1 most common alleles from all indiviudals (adj).
Amino acid change mutations for different populations. For explanations of the columns, visit the wiki page of PREFIX.pergeno.aa_mutations.csv.
Combine changed proteins from 10 populations and keep only unique ones.
This file could be added to current offical protein models to improve protein database search of mass spectrometry.
Proteins are the original names + '__' + populations. For example: ENSP00000390334.1|ENST00000453855.5|ENSG00000101019.22|OTTHUMG00000032335.13|OTTHUMT00000078865.3|UQCC1-213|UQCC1|126__adj|nfe_adj|fin_adj|mid_adj|ami_adj|asj_adj|eas_adj|amr_adj
, where ENSP00000390334.1|ENST00000453855.5|ENSG00000101019.22|OTTHUMG00000032335.13|OTTHUMT00000078865.3|UQCC1-213|UQCC1|126
is the original protein_id in GENCODE, adj|nfe_adj|fin_adj|mid_adj|ami_adj|asj_adj|eas_adj|amr_adj
are the populations with this altered protein.
If the population annotation is ALLSHARE
, it means that altered protein exists in all ten populations.
Total number of proteins: 101486, total number of AA: 38549462
stopGain_Pr | stopLoss_Pr | frameChange_Pr | variant_AA | variant_Pr | indel_pr | |
---|---|---|---|---|---|---|
adj | 90 | 44 | 156 | 14768 | 10728 | 404 |
afr_adj | 78 | 51 | 131 | 15697 | 11402 | 480 |
ami_adj | 101 | 50 | 157 | 15441 | 10993 | 432 |
amr_adj | 95 | 52 | 141 | 14769 | 10705 | 442 |
asj_adj | 90 | 44 | 158 | 15181 | 10830 | 419 |
eas_adj | 94 | 41 | 154 | 16524 | 11570 | 500 |
fin_adj | 92 | 47 | 151 | 15291 | 10881 | 439 |
mid_adj | 103 | 44 | 166 | 15215 | 10800 | 425 |
nfe_adj | 94 | 48 | 154 | 15134 | 10747 | 429 |
sas_adj | 84 | 46 | 147 | 15172 | 10917 | 419 |
- stopGain_Pr: count of proteins with stop-gain
- stopLoss_Pr: count of proteins with stop-loss
- frameChange_Pr: count of proteins with frame change
- variant_AA: total AA substitutions
- variant_Pr: count of proteins with AA substitutions
- indel_pr: count of proteins with insertion or deletion of AAs
RefSeq GCF_000001405.39
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/current
Release version: 109.20200815
RefSeq gene models with alleles from gnomAD 3.1 most common alleles from all indiviudals (adj).
Amino acid change mutations for different populations.
Combine changed proteins from 10 populations and keep only unique ones.
Total number of proteins: 114963, total number of AA: 76187452
stopGain_Pr | stopLoss_Pr | frameChange_Pr | variant_AA | variant_Pr | indel_pr | |
---|---|---|---|---|---|---|
adj | 104 | 0 | 174 | 28723 | 19298 | 632 |
afr_adj | 95 | 0 | 214 | 29885 | 20389 | 773 |
ami_adj | 134 | 15 | 224 | 29865 | 19645 | 697 |
amr_adj | 125 | 14 | 184 | 28724 | 19241 | 722 |
asj_adj | 117 | 0 | 176 | 29147 | 19083 | 681 |
eas_adj | 127 | 0 | 201 | 31791 | 20630 | 805 |
fin_adj | 113 | 0 | 186 | 29453 | 19233 | 692 |
mid_adj | 125 | 0 | 179 | 29496 | 19332 | 692 |
nfe_adj | 113 | 0 | 185 | 29291 | 19020 | 684 |
sas_adj | 112 | 0 | 165 | 29040 | 19428 | 659 |
Ensembl Homo_sapiens.GRCh38.101
Ensembl gene models with alleles from gnomAD 3.1 most common alleles from all indiviudals (adj).
Amino acid change mutations for different populations.
Combine changed proteins from 10 populations and keep only unique ones.
Total number of proteins: 112012, total number of AA: 42584212
stopGain_Pr | stopLoss_Pr | frameChange_Pr | variant_AA | variant_Pr | indel_pr | |
---|---|---|---|---|---|---|
adj | 102 | 1 | 147 | 15136 | 10750 | 401 |
afr_adj | 89 | 1 | 126 | 16096 | 11426 | 476 |
ami_adj | 112 | 4 | 151 | 15844 | 11013 | 429 |
amr_adj | 109 | 5 | 133 | 15131 | 10724 | 441 |
asj_adj | 103 | 0 | 144 | 15664 | 10851 | 415 |
eas_adj | 111 | 1 | 143 | 16953 | 11587 | 499 |
fin_adj | 102 | 1 | 140 | 15706 | 10898 | 436 |
mid_adj | 115 | 0 | 154 | 15692 | 10820 | 421 |
nfe_adj | 107 | 0 | 144 | 15558 | 10767 | 425 |
sas_adj | 98 | 0 | 137 | 15398 | 10940 | 417 |
UniProt reference proteins for proteome, human (12/02/2020).
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640_9606.fasta.gz ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640_9606_additional.fasta.gz
UniProt gene models with alleles from gnomAD 3.1 most common alleles from all indiviudals (adj).
Combine changed proteins from 10 populations and keep only unique ones.
The percentage of changed proteins will be similar to Ensembl, as UniProt information were mostly extracted from Ensembl gene models.
UP000005640_9606: 20609 proteins, 11395683 AA.
UP000005640_9606_additional: 77157 proteins, 27708325 AA.
Total: 97766 proteins, 39104008 AA.