Molecular similarity based on fast estimation of maximum common substructure (MCS).
Let the MCS of two molecules mol1 and mol2 as molMCS, here the similarity is defined as
Where n(mol) denotes the number of atoms in the molecule.
Noticing the rdkit.Chem.rdFMCS.FindMCS
is slow in some cases, such as large molecules or highly similar molecules, I use molecular fingerprint and deep first search (DFS) to quickly estimate the atoms in MCS in three steps.
- Get the atom environment of each atom in mol1 and mol2 by MorganFingerprint and get the atoms having same environments.
- Apply DFS to make the substructures by connecting the atoms having common environment, and choose the substructure including the most atoms as MCS.
- When the size of MCS obtain from mol1 and mol2 are different, pick the MCS having smaller size due to the possibility of same environment on different atoms.
First, I use the products in USPTO_50K dataset and apply CReM to generate 5 mutations for each molecule. These mutations are supposed to be "similar" with the parent molecules.
Next, I calculate the similarities between the parent molecules and their mutations and find the top 3 molst similar mutations using
- Tanimoto Similarity
- rdkit MCS similarity
- fast MCS similarity
To reduce the experimental time, I random chose 200 molecules from 50K molecules from USPTO_50K.
Both RDKit_MCS and fast_MCS similairty show high similarity scores, and Tanimoto shows lower similarity scores between parents mutations. While rdkit_MCS requires 60K loger time than Tanimoto, fast_MCS only needs around double time of Tanimoto.