/de_novo_design_RNN

Code for "De novo molecular design with chemical language models"

Primary LanguageJupyter NotebookMIT LicenseMIT

repo version python version pytorch license

De novo molecule design with chemical language models

In this repository, you will find a hands-on tutorial to generate focused libraries using RNN-based chemical language models.

This code serves as a support to to the protocol chapter: Grisoni F., Schneider G. (2022) De Novo Molecular Design with Chemical Language Models. In: Heifetz A. (eds) Artificial Intelligence in Drug Design. Methods in Molecular Biology, vol 2390. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1787-8_9

The code for the following two methods is provided:

  • Bidirectional Molecule Design by Alternate Learning (BIMODAL), designed for SMILES generation – see Grisoni et al. 2020.
  • Forward RNN, i.e., "classical" unidirectional RNN for SMILES generation. In addition to the method code, several pre-trained models are included.

Note! This repository contains the code for the hands-on chapter and has a teaching purpose only.

To use the most up-to-date versions of the methods, have a look at the following repositories:

Happy coding!

Table of Contents

  1. Getting started
  2. Using the Code
    1. Provided Jupyter notebook
    2. Sampling from a pre-trained model
    3. Fine-tuning a model on your data
    4. Data pre-processing
  3. Advanced functions
  4. Authors
  5. License
  6. How to cite

Getting started

This repository can be cloned with the following command:

git clone https://github.com/ETHmodlab/de_novo_design_RNN

To install the necessary packages to run the code, we recommend using conda. Once conda is installed, you can create the virtual environment as follows:

cd path/to/repository/
conda env create -f environment.yml

To activate the dedicated environment:

conda activate de_novo

Your code is now ready to use!

Using the code

Provided Jupyter notebook

In this repository, you can find a Jupyter notebook that will help you get started with using the code. We recommend having a look at the notebook first.

To use the provided notebook, move to the “example” folder and launch the Jupyter Notebook application, as follows:

cd example
jupyter notebook

A webpage will open, showing the content of the “code” folder. Double clicking on the file “de_novo_design_pipeline.ipynb” opens the notebook.

Each line of the provided code can be executed to visualize and reproduce the results of this tutorial. Below, you will also find some additional details into more advanced setting tuning.

Sampling from a pre-trained model

In this repository, we provide you with 22 pre-trained models you can use for sampling (stored in evaluation/). These models were trained on a set of 271,914 bioactive molecules from ChEMBL22 (Kd/I/IC50/EC50 <1μM), for 10 epochs.

To sample SMILES, you can create a new file in model/ and use the Sampler class. For example, to sample from the pre-trained BIMODAL model with 512 units:

from sample import Sampler
experiment_name = 'BIMODAL_fixed_512'
s = Sampler(experiment_name)
s.sample(N=100, stor_dir='../evaluation', T=0.7, fold=[1], epoch=[9], valid=True, novel=True, unique=True, write_csv=True)

Parameters:

  • experiment_name (str): name of the experiment with pre-trained model you want to sample from (you can find pre-trained models in evaluation/)
  • stor_dir (str): directory where the models are stored. The sampled SMILES will also be saved there (if write_csv=True)
  • N (int): number of SMILES to sample
  • T (float): sampling temperature
  • fold (list of int): number of folds to use for sampling
  • epoch (list of int): epoch(s) to use for sampling
  • valid (bool): if set to True, only generate valid SMILES are accepted (increases the sampling time)
  • novel (bool): if set to True, only generate novel SMILES (increases the sampling time)
  • unique (bool): if set to True, only generate unique SMILES are provided (increases the sampling time)
  • write_csv (bool): if set to True, the .csv file of the generated smiles will be exported in the specified directory.

Notes:

  • For the provided pre-trained models, only fold=[1] and epoch=[9] are provided.
  • The list of available models and their description are provided in evaluation/model_names.md

Fine-tuning a model

Fine-tuning requires a pre-trained model and a parameter file (.ini). Examples of the parameter files (BIMODAL and ForwardRNN) are provided in experiments/.

The fine-tuning set needs to be pre-processed, see next section.

You can start the sampling procedure with model/main_fine_tuner.py

Section Parameter Description Comments
Model model Type ForwardRNN, BIMODAL
hidden_units Number of hidden units Suggested value: 256 for ForwardRNN; 128 for BIMODAL
Data data Name of data file Has to be located in data/
encoding_size Number of different SMILES tokens 55
molecular_size Length of string with padding See preprocessing
Training epochs Number of epochs Suggested value: 10
learning_rate Learning rate Suggested value: 0.001
batch_size Batch size Suggested value: 128
Evaluation samples Number of generated SMILES after each epoch
temp Sampling temperature Suggested value: 0.7
starting_token Starting token for sampling G
Fine-Tuning start_model Name of pre-trained model to be used for fine-tuning

To fine-tune a model, you can run:

t = FineTuner(experiment_name = 'BIMODAL_random_512_FineTuning_template')
t.fine_tuning(stor_dir='../evaluation/', restart=False)

Parameters:

  • experiment_name: Name parameter file (.ini)
  • stor_dir: Directory where outputs can be found
  • restart: If True, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)

Note:

  • The batch size should not exceed the number of SMILES that you have in your fine-tuning file (taking into account the data augmentation).

Preprocessing

Data can be processed by using preprocessing/main_preprocessor.py:

from main_preprocessor import preprocess_data
preprocess_data(filename_in='../data/chembl_smiles', model_type='BIMODAL', starting_point='fixed', augmentation=1)

Parameters:

  • filename_in (str): name of the file containing the SMILES strings (.csv or .tar.xz)
  • model_type (str): name of the chosen generative method
  • starting_point (str): starting point type ('fixed' or 'random')
  • augmentation(int): augmentation folds [Default = 1]

Notes:

  • In preprocessing/main_preprocessor.py you will find info regarding advanced options for pre-processing (e.g., stereochemistry, canonicalization, etc.)
  • Please note that the pre-treated data will have to be stored in data/.

Advanced functions

If you want to personalize the pre-training or use advanced settings, please refer to the following repo: https://github.com/ETHmodlab/BIMODAL

Authors

Authors of the provided code (as in this repo)

Author of this tutorial

See also the list of contributors who participated in this project.

License

Creative Commons License
This code is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

If you use this code (or parts thereof), please cite it as:

@article{grisoni2020,
  title         = {Bidirectional Molecule Generation with Recurrent Neural Networks},
  author        = {Grisoni, Francesca and Moret, Michael and Lingwood, Robin and Schneider, Gisbert},
  journal       = {Journal of Chemical Information and Modeling},
  volume        = {60},
  number        = {3},
  pages         = {1175–1183}, 
  year          = {2020},
  doi           = {10.1021/acs.jcim.9b00943},
  url           = {https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943},
 publisher      = {ACS Publications}
}
@incollection{grisoni2021,
  author       = {Grisoni, Francesca and Schneider, Gisbert},
  title        = {De novo Molecule Design with Chemical Language Models},
  booktitle    = {Artfificial Intelligence in Drug Design},
  publisher    = {Springer},
  year         = 2021,
  volume       = {2390},
  series       = {Methods in Molecular Biology},
  pages        = {207-232},
  address      = {New York, NY},
  }