In this repository, we have designed and collected various Python notebooks to illustrate some of the concepts and best practices for applying unsupervised learning methods to molecular data.
To run these notebooks, you will need to use a kernel with the following libraries installed:
- RDKit
- Pandas
- ipykernel
You can create a new environment in the terminal via the following sequence of commands:
conda create -n myenv python=3.9
conda activate myenv
pip install pandas ipykernel rdkit-pypi
Then, update the Python kernel you are using to run each notebook. Go to Kernel
> Change kernel
> Python (myenv)
.
Additional installation instructions for specific libraries are given within each notebook.
There are three main data sets used in the Python notebooks presented here.
ZINC-250k is a subset of ZINC, a free database of commercially-available compounds for virtual screening. The ZINC-250k subset can be downloaded from Kaggle; each compound contains values for the partition coefficient (logP), quantitative estimate of drug-likeness (QED), and synthetic accessibility score (SAS). 1K compounds from ZINC-250k are provided in the data directory.
This subset is used to explore the various ways we can represent molecules computationally.
The QM7 data set is also used for some of the walk-through examples using molecular structural information. A copy of the data set is provided in data/qm7.xyz, and has already been preprocessed.
It includes SMILES strings, 3D coordinates, and various quantum properties.
The original data is available on this page.
One notebook applies clustering to molecular dynamics data and uses a 10k frame subset of the aspirin trajectory in MD17. The full trajectory for aspirin, as well as other MD17 trajectories, can be downloaded from www.sgdml.org, although we provide a copy of the subset used herein in the data directory.
This directory contains four notebooks touching on different aspects of molecular representations. These are:
This directory contains one notebook illustrating concepts and best practices for Dimensionality Reduction with molecular data.
This directory contains two notebooks illustrating different ways to use clustering on molecular data:
Finally, this directory contains one notebook demonstrating how to construct a Variational Autoencoder.
This work was initiated at the CECAM Workshop on Machine-learned potentials in molecular simulation: best practices and tutorials held in Vienna, July 2023.
Since then, the following authors have contributed to this repo and the accompanying article: