Guide to running sequence-structure-function pipeline

Refer to our paper in Bioinformatics for more details.

Setup

Collection of jupyter notebook scripts demonstrating various aspects of pipeline.
Conda enviornments required to run pipeline and jupyter notebooks are located in conda_yml.
- seq_struct_func.yml for steps 1,5-7
- alphafold2.yml for step 2
- build environnments with conda env create -f X.yml
- steps 3 and 4 require pyCHARMM and MMTSB to be installed in seq_struc_func
Recommended resources: 1 GPU with 10 GB memory and 1-4 CPUs
Scripts are listed in the order they should be run.

asr_seq_annotations.xlsx
- All enzymes, sequences, and annotations from structure-function pipeline
extant_msa.fasta
- Multiple sequence alignment used previously to construct ancestral sequence resurrects
fasta/
- Sequences in asr_seq_annotations.xlsx written as fasta format
pdb_with_fad/
- Directory containing all AlphaFold2 models with FAD cofactor
top_dock_pose/
- Directory cotaining lowest energy poses from minimization in explicit protein
log_reg_models/
- Pretrained statsmodels logistic regression models

script/gen_consensus_db.ipynb
- Create database of consensus sequence hits from AlphaFold2 MSAs

script/run_alphafold_consensus.ipynb
- Run example protein with AlphaFold2 using consensus sequence hits

script/fftdock.ipynb
- Use CHARMM Fast Fourier Transform Docking to get initial positions of ligand
script/prot_min.ipynb
- Refine FFT poses in explicit protein representation
script/cluster.ipynb
- Cluster poses to select representative poses

script/stereo.ipynb
- Predict stereochemistry from boltzmann weighted representative poses
script/reactivity.ipynb
- Predict reactivity from pose features
script/vis_pred.ipynb
- Visuallize predicted poses

script/gen_msa.ipynb
- Generate Multiple Sequence Alignment
script/get_bs_ss_residues.ipynb
- Get set of binding site and second shell residues
script/slice_msa.ipynb
- Modify MSA to be limited to binding site and second shell residues

script/run_automl.ipynb
- Fit multiple sequence alignment to predicted stereochemistry labels with gradient boosted trees and random forest models
script/shap_analysis.ipynb
- Calculate SHAP values for residues and visuallize how residues affect stereochemistry