/maxsmi

maxsmi: a guide to SMILES augmentation. Find the optimal SMILES augmentation for accurate molecular prediction.

Primary LanguageJupyter NotebookMIT LicenseMIT

Maxsmi: data augmentation for molecular property prediction using deep learning

Actions Status codecov Actions Status

License: MIT Documentation Status

GitHub closed pr GitHub open pr GitHub closed issues GitHub open issues

Table of contents

  • Project description
  • Citation
  • Installation using conda
    • Prerequisites
    • How to install
  • How to use maxsmi
    • Examples
      • How to train and evaluate a model using augmentation
      • How to make predictions
  • Documentation
  • Repository structure and important files
  • Acknowledgments

Project description

SMILES augmentation for deep learning based molecular property and activity prediction.

Accurate molecular property or activity prediction is one of the main goals in computer-aided drug design in which deep learning has become an important part. Since neural networks are data greedy and both physico-chemical and bioactivity data sets remain scarce, augmentation techniques have become a powerful assistance for accurate predictions.

This repository provides the code basis to exploit data augmentation using the fact that one compound can be represented by various SMILES (simplified molecular-input line-entry system) strings.

Augmentation strategies

  • No augmentation
  • Augmentation with duplication
  • Augmentation without duplication
  • Augmentation with reduced duplication
  • Augmentation with estimated maximum

Data sets

  • Physico-chemical data from MoleculeNet, available as part of DeepChem
    • ESOL
    • FreeSolv
    • lipophilicity
  • Bioactivity data on the EGFR kinase, retrieved from Kinodata

Deep learning models

  • 1D convolutional neural network (CONV1D)
  • 2D convolutional neural network (CONV2D)
  • Recurrent neural network (RNN)

The results of our study show that data augmentation improves the accuracy independently of the deep learning model and the size of the data. The best strategy leads to the Maxsmi models, which are available here for predictions on novel compounds on the provided data sets.

Citation

If you use maxsmi, don't forget to reference the work. The paper can be found at this link.

@article{kimber_2021_AILSCI,
  title = {Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
  author = {Talia B. Kimber and Maxime Gagnebin and Andrea Volkamer}
  journal = {Artificial Intelligence in the Life Sciences},
  volume = {1},
  pages = {100014},
  year = {2021},
  issn = {2667-3185},
  doi = {https://doi.org/10.1016/j.ailsci.2021.100014},
  url = {https://www.sciencedirect.com/science/article/pii/S2667318521000143}
}

Installation using conda

Prerequisites

Anaconda and Git should be installed. See Anaconda's website and Git's website for download.

How to install

  1. Clone the github repository:
git clone https://github.com/volkamerlab/maxsmi.git
  1. Change directory:
cd maxsmi
  1. Create the conda environment:
conda env create -n maxsmi -f devtools/conda-envs/test_env.yaml
  1. Activate the environment:
conda activate maxsmi
  1. Install the maxsmi package:
pip install -e .

How to use maxsmi

Examples

How to train and evaluate a model using augmentation

To get an overview of all available options:

python maxsmi/full_workflow.py --help

To train a model with the ESOL data set, augmenting the training set 5 times and the test set 2 times, training for 5 epochs:

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5

If no ensemble learning is wanted for the evaluation, add the flag as below:

Note: with ensemble learning computes a per compound prediction, whereas without ensemble learning compute a per SMILES prediction.

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5 --eval-strategy=False

To train a model with all chosen arguments:

Note: This command uses the default number of epochs (which is set to 250). Please allow time for the model to train.

python maxsmi/full_workflow.py --task="FreeSolv" --string-encoding="smiles" --aug-strategy-train="augmentation_with_duplication" --aug-strategy-test="augmentation_with_reduced_duplication" --aug-nb-train=5 --aug-nb-test=2 --ml-model="CONV1D" --eval-strategy=True --nb-epochs=250

To train a model with early stopping (this command could take time to be executed):

python maxsmi/full_workflow_earlystopping.py --aug-nb-train=3 --aug-nb-test=2

How to make predictions

These predictions use the precalculated Maxsmi models (best performing models in the study).

To predict the affinity of a compound against the EGFR kinase, e.g. given by the SMILES CC1CC1, run:

python maxsmi/prediction_unlabeled_data.py --task="affinity" --smiles_prediction="CC1CC1"

To predict the lipophilicity prediction for the semaxanib drug, run:

python maxsmi/prediction_unlabeled_data.py --task="lipophilicity" --smiles_prediction="O=C2C(\c1ccccc1N2)=C/c3c(cc([nH]3)C)C"

Documentation

The maxsmi package documentation is available here.

Repository structure and important files

|-- LICENSE
|-- README.md
|-- devtools
|-- docs
|-- maxsmi
|   |-- output_                         <- Saved outputs for results analysis
|   |-- prediction_models               <- Weights for Maxsmi models
|   |-- pytorch_utils                   <- Utilities for PyTorch
|   |-- results_analysis                <- Notebooks for results analysis
|   |-- tests                           <- Unit tests
|   |-- utils                           <- Utilities for data, encodings, smiles
|   |-- augmentation_strategies.py      <- SMILES augmentation strategies
|   |-- full_workflow.py                <- Training and evaluation of deep learning models
|   |-- full_workflow_earlystopping.py  <- Training using early stopping
|   |-- prediction_unlabeled_data.py    <- Maxsmi models available for user prediction

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.4.

Documentation and packaging: A special thank you to dominiquesydow for sharing her valuable knowledge with patience and kindness.

Copyright

Copyright (c) 2020, Talia B. Kimber at VolkamerLab.