Multi-state modeling of protein structures using AlphaFold

Building state-annotated HHsuite databases

All the required scripts and examples are in build_state_annotated_databases

  1. Getting activation state annotations for available experimental GPCR structures
    The list of GPCR structures with activation state annotations: GPCRdb, Activation state definition
  2. Preparing input files for building state-annotated HHsuite databases
    The script takes a list of PDB IDs for a state, either active, inactive, or intermediate states. For example, GPCR.Active, GPCR.Inactive, and GPCR.Intermediate are lists of active, inactive, and intermediate state GPCRs for this study, respectively. In addition, to select the preferred chain among multiple chains of a PDB file, a list of PDB IDs with the preferred chains is required. Example
  3. Running the script
    The script is based on the official guideline for building customized HHsuite databases. To run the script, HHsuite and UniClust30 database are required. Also, one needs to modify build_db.sh to adjust the path of the UniClust30 database.
    Example command:
./build_db.sh GPCR.${state}
  1. Expected outputs The output of the scripts will be a set of HHsuite database files for a GPCR state.
GPCR100.${state}_a3m.ff{data,index}
GPCR100.${state}_hhm.ff{data,index}
GPCR100.${state}_cs219.ff{data,index}
  1. Pre-built GPCR databases
    State-annotated GPCR databases can be obtained from our repositories on Zenodo or Google Drive.

GPCR structure prediction using AlphaFold

The structure prediction scripts rely on AlphaFold. We slightly modified it to conduct ablation studies and to model GPCR structures in a specific activation state. Follow the setup procedure and download genetic databases and model parameters for AlphaFold. In contrast to the original AlphaFold, our scripts are based on a non-Docker version and run on top of an Anaconda environment for AlphaFold. To create an environment for running AlphaFold, one may refer to an issue page of the AlphaFold repository or execute commands described in our script.

  1. Prerequisite
  • AlphaFold package
  • Anaconda environment for AlphaFold
  • Activation state annotated GPCR100 databases
  1. Update libconfig_alphafold.py One needs to update
  • Paths for executables: jackhmmer, hhblits, hhsearch, kalign
  • Paths for genetic databases: DOWNLOAD_DIR, {uniref90, mgnify, bfd, small_bfd, uniclust30, pdb70}_database_path, template_mmcif_dir, obsolete_pdbs_path
  • Paths for activation state annotated GPCR100 databases: gpcr100_active_db_path, gpcr100_inactive_db_path
  1. GPCR structure predictions We assumed an activated Anaconda environment that has all required libraries/packages for running AlphaFold.
  • Modeling GPCRs in a specific activation state (this study)
./structure_prediction/run.py ${FASTA_FILE} --preset study --state active    # for modeling in active state
./structure_prediction/run.py ${FASTA_FILE} --preset study --state inactive  # for modeling in inactive state
  • The original AlphaFold protocol
./structure_prediction/run.py ${FASTA_FILE} --preset original
  • Other protocols for the ablation study as described in the paper
# running the original AlphaFold protocol but using activation state-annotated GPCR databases
./structure_prediction/run.py ${FASTA_FILE} --preset original --state active     # for modeling in active state
./structure_prediction/run.py ${FASTA_FILE} --preset original --state inactive   # for modeling in inactive state

# running AlphaFold using sequence and MSA-based features, without structure templates-based features
./structure_prediction/run.py ${FASTA_FILE} --preset no_templ

# running AlphaFold using sequence-based features only, without MSA and structure templates-based features
./structure_prediction/run.py ${FASTA_FILE} --preset seqonly

# running MODELLER
./structure_prediction/run.py [FASTA file] --preset tbm
  • Sampling of intermediate conformations
./structure_prediction/interpolate.py --fasta_path=${FASTA_FILE} \
                                      --pdb_init=${INACTIVE_MODEL},${ACTIVE_MODEL} \
                                      --unk_pdb=True \
                                      --interpolate_region=${TM_RESIDUES}

Both active and inactive state models need to be generated first before providing them to the script. The option "interpolate_region" is optional, but it may improve structure comparison between states. An example input is as follows: "19-51,56-87,92-127,136-167,183-223,376-413,418-443".

Running the protocol on Colab

A slightly modified protocol using ColabFold pipeline is implemented on Colab. The main difference is the MSA generation step; the ColabFold-based protocol utilizes MMseqs2 for homologous sequence searches.

GPCR models in the active and inactive states

We have modeled non-olfactory human GPCRs in the active and inactive states using our multi-state modeling protocol. The models are available via Zenodo or Google Drive.

References

[1] Heo, L. and Feig, M., Multi-State Modeling of G-protein Coupled Receptors at Experimental Accuracy, bioRxiv (2021). Link
[2] Jumper, J. et al., Highly accurate protein structure prediction with AlphaFold, Nature (2021), 596, 583-589. Link
[3] Mirdita, M. et al., ColabFold - Making protein folding accessible to all, bioRxiv (2021), 10.1101/2021.08.15.456425. Link