/EnzyGen

Primary LanguagePython

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

Model Architecture

This repository contains code, data and model weights for ICML 2024 paper Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

The overall model architecture is shown below:

image

Environment

The dependencies can be set up using the following commands:
conda create -n enzygen python=3.8 -y 
conda activate enzygen 
conda install pytorch=1.10.2 cudatoolkit=11.3 -c pytorch -y 
bash setup.sh 

Download Data

We provide the EnzyBench at EnzyBench and Enzyme Classification Tree (EC) ID to index dict at EC_Dict

Please download the dataset and put them in the data folder.

mkdir data 
cd data 
wget https://drive.google.com/file/d/1VycT_gFV2JBpRMCBZlwwxLLRcZDljXCS/view?usp=drive_link
wget https://drive.google.com/file/d/1BCitsFRQpzUbGss7xBpTpvKcMcJh_oOz/view?usp=drive_link

Download Model

We provide the checkpoint used in the paper at Model

Please download the checkpoints and put them in the models folder.

If you want to train your own model, please follow the training guidance below

Training

If you want to train a model with enzyme-substrate interaction constraint as introduced in our paper, please follow the script below:
bash train_enzyme_substrate_33layer.sh

If you want to train a model without enzyme-substrate interaction constraint, please follow the script below:

bash train_cluster_enzyme_33layer.sh

From our experiences, first training a model without enzyme-substrate interaction constraint for around 200,000 steps and then continue training based on sequence recovery loss, coordinate recovery loss and enzyme-substrate interaction loss will lead to the best performance!

Inference

To design enzymes for the 30 testing third-level categories, please use the following scripts:
bash generation.sh

There are five items in the output directory:

  1. protein.txt refers to the designed protein sequence
  2. src.seq.txt refers to the ground truth sequences
  3. pdb.txt refers to the target PDB ID and the corresponding chain
  4. pred_pdbs refers to the directory of designed pdbs
  5. tgt_pdbs refers to the directory of target pdbs

Evaluation

We provide the ESP evaluation data at [ESP_data_eval](https://drive.google.com/file/d/1q8NENdVWBufz5fDk7TviS6h6_BKmfviN/view?usp=drive_link)

The format for ESP evaluation is (Protein_Sequence Substrate_Representation) for each test case.

The evaluation code for ESP score is developed by Alexander Kroll, which can be found at link

Expected Results

Protein Family 1.1.1 1.11.1 1.14.13 1.14.14 1.2.1 2.1.1 2.3.1 2.4.1
EnzyGen 0.64 0.98 0.38 0.42 0.72 0.80 0.61 0.38
Protein Family 2.4.2 2.5.1 2.6.1 2.7.1 2.7.10 2.7.11 2.7.4 2.7.7
EnzyGen 0.86 0.66 0.53 0.76 0.92 0.93 0.80 0.79
Protein Family 3.1.1 3.1.3 3.1.4 3.2.2 3.4.19 3.4.21 3.5.1 3.5.2
EnzyGen 0.76 0.62 0.88 0.47 0.26 0.73 0.40 0.14
Protein Family 3.6.1 3.6.1 3.6.5 4.1.1 4.2.1 4.6.1 -- Avg
EnzyGen 0.66 0.78 0.40 0.80 0.93 0.57 -- 0.65

Citation

If you find our work helpful, please consider citing our paper.
@inproceedings{songgenerative,
  title={Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates},
  author={Song, Zhenqiao and Zhao, Yunlong and Shi, Wenxian and Jin, Wengong and Yang, Yang and Li, Lei},
  booktitle={Forty-first International Conference on Machine Learning}
}