/SM_Pfam-gnomAD-statistics

Repository containing notebooks to compute statistics in the paper "A unified approach to evolutionary conservation and population constraint in proteins".

MIT LicenseMIT

A unified approach to evolutionary conservation and population constraint in proteins

DOI License: MIT

Repository containing notebooks to compute statistics in the paper "A unified approach to evolutionary conservation and population constraint in proteins".

Author: Stuart MacGowan (smacgowan@dundee.ac.uk)

Dataset

The analysis is based on aggregated statistics we computed from data accessed from the following databases:

  • Pfam-A database of protein families (version 31.0)
  • gnomAD database of human genetic variation (version 2.1.1).
  • ClinVar database of human genetic variants and their clinical significance.
  • PDBe database of protein structures.

These were processed into a single dataset of aggregated statistics for each Pfam domain, which is provided in data/pfam-gnomAD-clinvar-pdb-colstats_c7c3e19.csv.gz.

Manuscript figures

The figures in the manuscript are generated by the notebooks in the figure folders under manuscript-figures.

  • Figure 1B: Frequency distribution of gnomAD missense variants across all amino acid residues in Pfam domains.
  • Figure 1C: Frequency distributions of gnomAD missense variants over alignment columns of Pfam domains.
  • Figure 1D: Total number of gnomAD missense or synonymous variants vs. the Shenkin diversity at each position across SH2 domains.
  • Figure 2A: Cumulative distributions of the normalised missense enrichment score or normalised Shenkin for positions where the consensus relative solvent accessibility class is core, partially exposed, or surface.
  • Figure 3A: The conservation plane: classifying residues in Pfam domains with evolutionary conservation and population constraint.
  • Figure 4A: Odds ratios of the enrichment of protein-ligand interacting residues from BioLiP within sites in different conservation plane categories.
  • Figure 4B: PPI site enrichments.
  • Figure 4C: ClinVar Pathogenic site enrichments relative to the gnomAD missense background.

Citation

Stuart A. MacGowan, Fábio Madeira, Thiago Britto-Borges et al. A unified approach to evolutionary conservation and population constraint in protein domains highlights structural features and pathogenic sites, 13 July 2023, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-3160340/v1]

License

This repository and its contents were created by Stuart A. MacGowan (@stuartmac) at the University of Dundee and is provided under the MIT license. See LICENSE for details.