- Zika virus Drug Design using Generative RNN-LSTM and Proteomics
Zika virus was first reported in the Zika Forest of Uganda in 1947 among nonhuman primates. Zika virus (ZIKV) and dengue virus (DENV) are closely related flaviviruses that are transmitted by Aedis aegypti, the mosquito vector, and with overlapping geographical distributions. While most ZIKV infections are asymptomatic, they cause a similar immune response and symptoms including fever and body pain. The most well known symptoms of ZIKV infection is in pregnant women, which pose a significant risk to the developing embryo, with microcephaly and other adverse outcomes.
Proteomics is the large-scale study of proteomes. A proteome is a set of proteins produced in an organism, system, or biological context. Proteomics enables the identification of ever-increasing numbers of proteins. This varies with time and distinct requirements, or stresses, that a cell or organism undergoes.
Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials.
SMILES: Simplified molecular-input line-entry system, is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
The whole pipeline of this project looks like this:
Proteomic analysis of plasma from healthy and ZIKV infected human identified proteins that significantly changed in expression level.
Differentially expressed Proteins
Drugs datasets used in this project are from two database: Moses and ChEMBL. Together these two data sets represent about 2.5 million smiles.
The preprocess steps includes removing duplicates, salts, stereochemical information, nucleic acids and long peptides.
def main():
config = process_config(CONFIG_FILE)
# create the experiments dirs
create_dirs(
[config.exp_dir, config.tensorboard_log_dir, config.checkpoint_dir])
#Create the data generator.
train_dl = DataLoader(config, data_type='train')
valid_dl = copy(train_dl)
valid_dl.data_type = 'valid'
#Create the model.
modeler = LSTMChem(config, session='train')
#Create the trainer.
trainer = LSTMChemTrainer(modeler, train_dl, valid_dl)
#Start training the model.
trainer.train()
if __name__ == '__main__':
main()
Search the literatures and got experiment validated anti-ZIKV drugs, such as:
Niclosamide OC1=C(C=C(Cl)C=C1)C(=O)NC1=C(Cl)C=C(C=C1)N+=O
Sofosbuvir CC(C)OC(=O)C(C)NP(=O)(OCC1C(C(C(O1)N2C=CC(=O)NC2=O)(C)F)O)OC3=CC=CC=C3
Add them into the dataset for fine-tune.
modeler = LSTMChem(config, session='finetune')
finetune_dl = DataLoader(config, data_type='finetune')
finetuner = LSTMChemFinetuner(modeler, finetune_dl)
finetuner.finetune()
Use python library rdkit and meeko to do batch autodocking by the following code. The rdkit can convert SMILE string into embed molecule. The autodock vina can do the docking using the parameters that user inputs.
lig = rdkit.Chem.MolFromSmiles(fineTuned_smiles)
protonated_lig = rdkit.Chem.AddHs(lig)
rdkit.Chem.AllChem.EmbedMolecule(protonated_lig)
meeko_prep = meeko.MoleculePreparation()
meeko_prep.prepare(protonated_lig)
lig_pdbqt = meeko_prep.write_pdbqt_string()
v = vina.Vina(sf_name='vina', verbosity=0)
v.set_receptor('target_protein.pdbqt')
v.set_ligand_from_string(lig_pdbqt)
v.compute_vina_maps(center=[-2.029, -53.903,18.744], box_size=[60, 60, 60])
v.dock(exhaustiveness=200, n_poses=5)
output_pdbqt = v.poses(n_poses=5)
-
Zika Virus: https://www.who.int/news-room/fact-sheets/detail/zika-virus
-
Introduction of LSTM: Understanding LSTM -- a tutorial into Long Short-Term Memory Recurrent Neural Networks
-
Introduction of LSTM_chem: [Generative Recurrent Networks for De Novo Drug Design]
-
AutoDock: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/autodock
-
RDKit: https://www.rdkit.org/docs/GettingStartedInPython.html
- Author: Wei Zhang
- Email: zwmc@hotmail.com
- Github: https://github.com/vveizhang
- Linkedin: https://www.linkedin.com/in/wei-z-76253523/