Assigning the Origin of Microbial Natural Products by Chemical Space Map and Machine Learning

The Natural Product Atlas MAP4 TMAP colored by fungal (in magenta) or bacterial (in green) origin.

Repository contents:

a first jupyter lab notebook containing the code to reproduce this work NPAtlas.ipynb;
the December 2019 version (as today 22/07/2020 the last version) of the Natural Product Atlas np_atlas_2019_12.tsv;
the k-NN and the SVM classifiers that distinguish the natural products of fungal origin from the natural products of bacterial origin NN_NPAtlas.pkl and SVM_NPAtlas.pkl;
a second jupyter lab notebook that shows few applications of the classifiers Classifier_test.ipynb;
a bash script to run the NPAtlas notebook run_NPAtlas.sh.

NPAtlas.ipynb Jupyter Notebook Description:

1. Properties Calculation

The December 2019 version of the Natural Product Atlas was downloaded. The MW and the compound origin were read from the downloaded data. HBA and HBD, cLogP, TPSA, and fraction of sp3 carbon were calculated using RDKit. The boiling point was calculated using the open source code of the JRgui as the Joback boiling temperature. The MAP4 fingerprint was calculated in 1024 dimensions. Molecules that violated more than one Lipinski rules were labelled as non-Lipinski. To identify glycosylated and/or peptidic structures Daylight SMARTS language was used.

2. SVM and k-NN Classifiers

The dataset was assigned to training or test set with a 50% random split. A scikit-learn SVM and a k-NN classifiers with custom kernel were optimized using the training set to minimize the ROC AUC of the test set, using the MAP4 fingerprint as input.

Other 2 SVM and two k-NN classifiers were implemented as described above but using only a specific subset of the training set during the training. For an SVM and a k-NN classifiers only the structures of the training set containing a dipeptide moiety were used. For the other SVM and k-NN classifiers only the structures of the training set containing an acetal moiety were used.

As a baseline for our analysis, we have used an SVM trained with physiochemical properties: fraction of sp3 carbon, HBA, HBD, ALogP, TPSA, MW, and calculated boiling point.

For the evaluation of the classifiers we considered the class bacterium to be the positive class and the class fungus to be the negative one.

Please note that when using MAP4 for machine learning a custom kernel (or a custom loss function) is needed because the similarity/dissimilarity between two MinHashed fingerprints cannot be assessed with "standard" Jaccard, Manhattan, or Cosine functions. In fact, due to MinHashing, the order of the features matters and the distance cannot be calculated "feature-wise". There is a well written blog post that explains it.

Custom Kernels

The custom kernel implemented for the SVM models calculates the similarity matrix between two lists of MinHashed fingerprints; where the similarity of fingerprint a and fingerprint b is calculated (1) counting of elements with the same value and the same index across a and b, and (2) dividing the obtained value by the number of elements of fingerprint a.

The custom kernel implemented for the k-NN classifiers calculates the distance between two MinHashed fingerprints as one minus the similarity between the two fingerprints calculated as in the SVM custom kernel.

3. The Natural Product Atlas MAP4 TMAP

Using the indices generated by the MinHashing procedure of the MAP4 calculation, an LSH forest was generated and used to layout the TMAP. The resulting TMAP can be found here.