/vaemols

Variational Autoencoder for Molecules

Primary LanguageJupyter NotebookMIT LicenseMIT

Variational Autoencoder for Molecules

Variational autoencoder for molecules in tensorflow.

Dependencies

  1. Rdkit
conda install -c rdkit rdkit
  1. Tensorflow

cpu-version

pip install tensorflow

gpu-version

pip install tensorflow-gpu

Preprocessing

1. Data

ChEBML 24 Database was used for SMILES data.

SMILES strings were padded with spaces to max_len(default=120) and strings larger than max_len were discarded. Remaining strings are labeled character by character(max_len labels in one string).

2. preprocess.py

Does the following steps:

  1. Downloads chembl_24_1_chemreps.txt.gz
  2. Preprocess SMILES strings
  3. Saves processed data into numpy arrays.

Numpy arrays contains training data, testing data, dictionaries for character <-> label(integer) interchange.

Training

1. Model

Model consists of CNN encoder and CuDNNGRU decoder and defined in vae.py

2. train.py

Does the following steps:

  1. Loads preprcessed data
  2. trains with fit_generator using DataGenerator

Notebooks

Notebooks are here to help after training is done.

This notebook helps to get variational structures when given a SMILES string.

This notebook helps visualizing learned latent space using a plot or tensorboard.

tensorboard visualization example:

https://raw.githubusercontent.com/YunjaeChoi/vaemols/master/doc/image/tensorboard.png

This notebook helps to get top_k similar molecules measured by euclidean distance in latent space.