/PocketFlow

an autoregressive flow model incorporated with chemical acknowledge for generating drug-like molecules inside protein pockets

Primary LanguagePython

PocketFlow is a data-and-knowledge driven structure-based molecular generative model

Deep learning-based molecular generation has extensive applications in many fields, particularly drug discovery. However, majority of current deep generative models (DGMs) are ligand-based and do not consider chemical knowledge in molecular generation process, often resulting in a relatively low success rate. We herein propose a structure-based molecular generative framework with chemical knowledge explicitly considered (named PocketFlow), which generates novel ligand molecules inside protein binding pockets. In various computational evaluations, PocketFlow showed a state-of-the-art performance with generated molecules being 100% chemically valid and highly drug-like. Ablation experiments prove a critical role of chemical knowledge in ensuring the validity and drug-likeness of the generated molecules. We applied PocketFlow to two new target proteins that are related to epigenetic regulation, HAT1 and YTHDC1, and successfully obtained wet-lab validated bioactive compounds. The binding modes of the active compounds with target proteins are close to those predicted by molecular docking, and further confirmed by the X-ray crystal structure. All the results suggest that PocketFlow is a useful deep generative model, capable of generating innovative bioactive molecules from scratch given a protein binding pocket.

Requirements:

  • Python 3.8
  • pytorch 1.12
  • Pytorch_Geometric 2.1.0
  • RDKit
  • Openbabel
  • PyMol

Molecular generation

The molecule can be generated by running the following command, where the pocket pdb file and the model parameter file are required, and the rest of the parameters are optional

python main_generate.py -pkt test_samples/test_pocket10/1bvr_C_rec_pocket10-surf.pdb --ckpt ckpt/ZINC-pretrained-255000.pt -n 100 -d cuda:0 --root_path gen_results --name 1bvr -at 1.0 -bt 1.0 --max_atom_num 35 -ft 0.5 -cm True --with_print True

All parameters of generation:

usage: main_generate.py [-h] [-pkt POCKET] [--ckpt CKPT] [-n NUM_GEN] [--name NAME] [-d DEVICE] [-at ATOM_TEMPERATURE] [-bt BOND_TEMPERATURE] [--max_atom_num MAX_ATOM_NUM] [-ft FOCUS_THRESHOLD] [-cm CHOOSE_MAX]
                        [--min_dist_inter_mol MIN_DIST_INTER_MOL] [--bond_length_range BOND_LENGTH_RANGE] [-mdb MAX_DOUBLE_IN_6RING] [--with_print WITH_PRINT] [--root_path ROOT_PATH] [--readme README]

optional arguments:
  -h, --help            show this help message and exit
  -pkt POCKET, --pocket POCKET
                        the pdb file of pocket in receptor
  --ckpt CKPT           the path of saved model
  -n NUM_GEN, --num_gen NUM_GEN
                        the number of generateive molecule
  --name NAME           receptor name
  -d DEVICE, --device DEVICE
                        cuda:x or cpu
  -at ATOM_TEMPERATURE, --atom_temperature ATOM_TEMPERATURE
                        temperature for atom sampling
  -bt BOND_TEMPERATURE, --bond_temperature BOND_TEMPERATURE
                        temperature for bond sampling
  --max_atom_num MAX_ATOM_NUM
                        the max atom number for generation
  -ft FOCUS_THRESHOLD, --focus_threshold FOCUS_THRESHOLD
                        the threshold of probility for focus atom
  -cm CHOOSE_MAX, --choose_max CHOOSE_MAX
                        whether choose the atom that has the highest prob as focus atom
  --min_dist_inter_mol MIN_DIST_INTER_MOL
                        inter-molecular dist cutoff between protein and ligand.
  --bond_length_range BOND_LENGTH_RANGE
                        the range of bond length for mol generation.
  -mdb MAX_DOUBLE_IN_6RING, --max_double_in_6ring MAX_DOUBLE_IN_6RING
  --with_print WITH_PRINT
                        whether print SMILES in generative process
  --root_path ROOT_PATH
                        the root path for saving results
  --readme README, -rm README
                        description of this genrative task

Spliting Pocket

Based on the pose of the ligand, the pocket structure can be splited from the protein structure

from pocket_flow import SplitPocket, Protein, Ligand

pro = Protein('/path/to/protein.pdb')
lig = Ligand('/path/to/ligand.sdf')
dist_cutoff = 10
pocket_block, _ = SplitPocket._split_pocket_with_surface_atoms(pro, lig, dist_cutoff)
open('/path/to/pocket.pdb','w').write(pocket_block)

Dataset

The raw CrossDocked2020 dataset is large, which need about 50G disk space. You can donwload the processed data from Pocket2Mol

from pocket_flow import CrossDocked2020

unexpected_sample = [
    line.split()[-1] for line in open('data/unexcept_element_sample_new.csv').read().split('\n')
    ]
cs2020 = CrossDocked2020(
    './data/crossdocked_pocket10/',
    './data/crossdocked_pocket10/index.pkl',
    unexpected_sample=unexpected_sample
    )
cs2020.run(
    dataset_name='crossdocked_pocket10_processed_35Atoms.lmdb',
    max_ligand_atom=35,
    only_backbone=False,
    lmdb_path='./data/'
    )

The pretraining datase of PocketFlow was choosed from ZINC 3D. You can download ZINC 3D, and then use make_pretrain_data.py to produce the pretraining dataset.