TamGen: Target-aware Molecule Generation for Drug Design Using a Chemical Language Model
This is the implementation of the paper TamGen: Target-aware Molecule Generation for Drug Design Using a Chemical Language Model
.
Our implementation is built on fairseq-v0.8.0
conda create -n TamGen python=3.9
conda activate TamGen
bash setup_env.sh
Please refer to the README in the folder data
You can build your customized dataset through the following methods:
-
Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb.
python scripts/build_data/prepare_pdb_ids_center.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${THRESHOLD}
-
PDB_ID_LIST
format: CSV format with the following columns:pdb_id,center_x,center_y,center_z,[uniprot_id]
.[uniprot_id]
is optional. -
DATASET_NAME
: You could specify it by yourselv. The simplest way is to set it astest
. -
OUTPUT_PATH
: The output path of the processed data. -
THRESHOLD
: The radius of the pocket region whose center iscenter_x,center_y,center_z
.
-
-
Build customized dataset based on pdb ids, the script will automatically find the binding sites according to the ligands in the structure file.
python scripts/build_data/prepare_pdb_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
-
PDB_ID_LIST
format: CSV format with columnspdb_id,[ligand_inchi,uniprot_id]
, where[]
means optional. -
THRESHOLD
: A residue$r$ is considered part of the pocket region, if any atom in$r$ lies within THRESHOLD angstroms of a ligand atom. For a givenpdb_id
, its associated ligands can be found in database/PdbCCD. - The remaining parameters are the same as those in method 1.
-
-
Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb, and add the provided scaffold to each center
python scripts/build_data/prepare_pdb_ids_center_scaffold.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${THRESHOLD} --scaffold-file ${SCAFFOLD_FILE}
-
SCAFFOLD_FILE
: It contains molecular scaffolds that will be incorporated into the processed database. These scaffolds serve as structural templates for subsequent conditional generation of new molecules. - The remaining parameters are the same as those in method 1.
For customized pdb strcuture files, you can put your structure files to the
--pdb-path
folder, and in thePDB_ID_LIST
csv file, put the filenames in thepdb_id
column.We provide an example about how to build and use customized data in customized_example.
-
The checkpoint can be found in https://doi.org/10.5281/zenodo.13751391
. Please download checkpoints.zip
& gpt_model.zip
and uncompress them. After that, you will get two folders: checkpoints
and gpt_model
. Please place them under the folder TamGen/
. The structures of the two folders are shown below:
checkpoints/
├── README.MD
├── crossdock_pdb_A10
│ └── checkpoint_best.pt
└── crossdocked_model
└── checkpoint_best.pt
gpt_model/
├── checkpoint_best.pt
└── dict.txt
# train a new model
bash scripts/train.sh -D ${DATA_PATH} --savedir ${SAVED_MODEL_PATH}
For example, one can run bash scripts/train.sh -D data/crossdocked/bin/ --savedir crossdock_train --fp16
to train models.
One can refer to scripts/generate.sh
for running inference code.
We provide an example by running bash scripts/example_inference.sh
We provide a demo at interactive_decode.ipynb
In the first cell of the demo
from TamGen_Demo import TamGenDemo, prepare_pdb_data
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
worker = TamGenDemo(
data="./TamGen_Demo_Data",
ckpt="checkpoints/crossdock_pdb_A10/checkpoint_best.pt"
)
- Specify the GPU id
- Download the checkpoint and place it into "checkpoints/crossdock_pdb_A10/checkpoint_best.pt" or your specificied position
- Download the pre-trained GPT model and put it into the folder
gpt_model
Please kindly cite this paper if you use the code or you find TamGen is helpful for your work
@Article{Wu2024TamGen,
author={Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe
and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun
and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi
and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong
and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan},
title={TamGen: drug design with target-aware molecule generation through a chemical language model},
journal={Nature Communications},
year={2024},
month={Oct},
day={29},
volume={15},
number={1},
pages={9360},
issn={2041-1723},
doi={10.1038/s41467-024-53632-4},
url={https://doi.org/10.1038/s41467-024-53632-4}
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.