/seq_struct_func

Example scripts for navigating protein sequence structure function space.

Primary LanguageJupyter NotebookMIT LicenseMIT

Guide to running sequence-structure-function pipeline

Refer to our paper in Bioinformatics for more details.

Setup

Data: (si_data/) Data necessary for running examples

  • asr_seq_annotations.xlsx
    • All enzymes, sequences, and annotations from structure-function pipeline
  • extant_msa.fasta
    • Multiple sequence alignment used previously to construct ancestral sequence resurrects
  • fasta/
    • Sequences in asr_seq_annotations.xlsx written as fasta format
  • pdb_with_fad/
    • Directory containing all AlphaFold2 models with FAD cofactor
  • top_dock_pose/
    • Directory cotaining lowest energy poses from minimization in explicit protein
  • log_reg_models/
    • Pretrained statsmodels logistic regression models

Toppar: (toppar/) CHARMM Topology and Parameter files

Step 1: (consensus/) Generating consensus sequence hits library

  • script/gen_consensus_db.ipynb
    • Create database of consensus sequence hits from AlphaFold2 MSAs

Step 2: (model/) Generating AlphaFold2 Structures

  • script/run_alphafold_consensus.ipynb
    • Run example protein with AlphaFold2 using consensus sequence hits

Step 3: (cofactor/) Adding FAD Cofactor

  • script/fad.ipynb
    • Add FAD cofactor into generated example protein

Step 4: (dock/) Docking Array of Ligands

  • script/fftdock.ipynb
    • Use CHARMM Fast Fourier Transform Docking to get initial positions of ligand
  • script/prot_min.ipynb
    • Refine FFT poses in explicit protein representation
  • script/cluster.ipynb
    • Cluster poses to select representative poses

Step 5: (pred/) Prediction of Stereochemistry and Reactivity

  • script/stereo.ipynb
    • Predict stereochemistry from boltzmann weighted representative poses
  • script/reactivity.ipynb
    • Predict reactivity from pose features
  • script/vis_pred.ipynb
    • Visuallize predicted poses

Step 6: (msa/) Generate Multiple Sequence Alignment Localized to Binding Site

  • script/gen_msa.ipynb
    • Generate Multiple Sequence Alignment
  • script/get_bs_ss_residues.ipynb
    • Get set of binding site and second shell residues
  • script/slice_msa.ipynb
    • Modify MSA to be limited to binding site and second shell residues

Step 7: (seq_func/) Training Sequence-Function Model and SHAP Analysis

  • script/run_automl.ipynb
    • Fit multiple sequence alignment to predicted stereochemistry labels with gradient boosted trees and random forest models
  • script/shap_analysis.ipynb
    • Calculate SHAP values for residues and visuallize how residues affect stereochemistry