/exmoset

Automating the generation of human readable descriptions of arbitrary subsets of molecular space.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

EXplainable MOlecular SETs

Package to automate the identification of molecular similarity given an arbitrary set of molecules and associated functions to calculate the value of particular properties (label fingerprints).

Installation

The easiest way to install is using pip following the setup of a new conda environment with rdkit installed (rdkit does not play well with pip).

  1. conda create -n exmoset python=3.8
  2. conda activate exmoset
  3. conda install -c conda-forge rdkit
  4. pip install exmoset

API

The MolSpace Class handles the analysis of a given molecular set in accordance with the list of fingerprints provided. The molecules can be passed to Molspace in any format, with additional conversions specified by the mol_converters argument.

analysis = MolSpace(molecules,
                    fingerprints = fingerprints,
                    file="data/QM9_Data.csv",
                    mol_converters={"rd" : Chem.MolFromSmiles, "smiles" : str},
                    index_col="SMILES")

Fingerprints

Fingerprints are a standardized way for Molspace to calculate the properties for each molecule it is analysing. Its arguments determine the grammatical structure of the label that will be produced (property, noun and verb), and a function to calculate the property (calculator) along with what molecular format this function works on (mol_format). The grammatical structure of the resulting labels is a work in progress, and may lead to some poor results that require further processing.

def contains_C(mol):
      return 1 if C in mol else 0

contains_carbon = Fingerprint(property="Contains C",
                  verb="contain",
                  noun="Molecule",
                  label_type="binary",
                  calculator=contains_C,
                  mol_format="smiles")

Molecule Converters

The mol_converters argument provides the means to transform each molecule into alternate representations. The argument is a dictionary with the following structure {Identifier : Function_that_will_convert} that is expanded in the following way:

formats = {key : mol_converters[key](mol) for key in mol_converters.keys()} # Assigns each identifier to its assocaited representation by
self.Molecules.append(Molecule(mol, **formats)) # Unpacks the new formats as kwargs into the Molecule object

An example is provided below

mol_converters = {"rd" : Chem.MolFromSmiles, "smiles" : str} # Will convert molecules provided as smiles strings into Chem.rd objects from RDKit and maintain the SMILES in the dataset as strings.

Label Types

Binary

Binary labels indicate the presence of absence of a particular element, bond type, or molecular feature (such as aromaticity). Simplest to calculate and best behaved with respect to the entropy estimators. Uses a discrete entropy estimator.

Multiclass

Discrete labels where the value can be any integer. Examples include number of rings, number of atoms, or number of each type of bond. Uses a discrete entropy estimator.

Continuous

Continuous labels where the value can be any real number. Examples include electronic spatial extent, dipole moment, and free energy. Uses the continuous entropy estimator

References

The mathematical methods employed in this codebase are based on the following publications:

  • Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 66138.
  • Ross, B. C. Mutual Information between Discrete and Continuous Data Sets. PLOS ONE 2014, 9, 1–5.

Continuous entropy estimation is provided by Paul Broderson's entropy estimators package (https://github.com/paulbrodersen/entropy_estimators).