/GlycosylationStatistics

Systematic detection of sugar moieties in COCONUT using the Sugar Removal Utility

Primary LanguageJupyter NotebookMIT LicenseMIT

DOI License: MIT GitHub issues GitHub contributors GitHub release

Description and Analysis of Glycosidic Residues in the Largest Open Natural Products Database

Code for automated, systematic detection of sugar moieties in the COlleCtion of Open Natural prodUcTs (COCONUT) database

Description

This repository contains Java source code for automatically detecting and analysing glyosidic moieties in silico in the largest open natural products database COCONUT, as described in Schaub, J.; Zielesny, A.; Steinbeck, C.; Sorokina, M. Description and Analysis of Glycosidic Residues in the Largest Open Natural Products Database. Biomolecules 2021, 11, 486., using the Sugar Removal Utility.
Additionally, similar analyses are done with datasets from the ZINC15 database, DrugBank, and ChEMBL.
Python scripts and Jupyter notebooks for the curation of some used datasets and analysing and visualising the results are also supplied in this repository.

Please NOTE that the code in this repository is primarily supposed to show how the glycosylation statistics published in the article linked above were generated and to allow reproduction and executing of the same analyses for other datasets. It is not considered a software by itself. Hence, things like the publication of a Maven artifact for straightforward installation are not given here.

The Sugar Removal Utility, however, can be installed as a Maven artifact in a straightforward manner and used in your own scripts and workflows to analyse other datasets this way.

Contents

Source code for glycosylation statistics analysis

In the directory /src/test/java/de/unijena/cheminf/deglycosylation/stats/ the class GlycosylationStatisticsTest can be found. It is a JUnit test class with multiple test methods that can be run in a script-like fashion to do the various analyses. Using an IDE like e.g. IntelliJ is recommended. Please note that some directories etc. will need to be adjusted and some datasets be put into the /src/test/resources/ directory (see below) to run the tests yourself.

The directory /Python_scripts_and_notebooks/ contains a python script for picking a diverse subset of a larger datasets using the RDKit MaxMin algorithm. For the reported analyses, it has been used to reduce in size the downloaded ZINC "in-vitro" subset while preserving diversity. Additionally, two Jupyter Notebooks can be found in this directory that have been used to analyse and visualise some of the test results.

Installation

This is a Maven project. In order to do the described analyses on your own, download or clone the repository and open it in a Maven-supporting IDE (e.g. IntelliJ) as a Maven project and execute the pom.xml file. Maven will then take care of installing all dependencies.
To run the COCONUT-analysing tests, a MongoDB instance needs to be running on your platform and the COCONUT NP database imported to it. The respective MongoDB dump can be downloaded at https://coconut.naturalproducts.net/download.
To run the Python scripts and Jupyter Notebooks, installing Anaconda is recommended, to also ease the installation of required libraries, like the open-source cheminformatics software RDKit.

Required datasets

  • COCONUT: To run the COCONUT-analysing tests, a MongoDB instance needs to be running on your platform and the COCONUT NP database imported to it. The respective MongoDB dump can be downloaded at https://coconut.naturalproducts.net/download. Please check and adjust the credentials for the connection to MongoDB in the code and adjust them if needed. One test method also analyses COCONUT in the form of an SDF. This file can also be obtained from the given webpage and needs to be placed in the /src/test/resources/ directory.
  • ZINC15: A list of available ZINC15 subsets can be found here. It is recommended to use the program wget to download the subsets. All subsets were downloaded as SMILES files.
    • ZINC "for-sale": A part of the ZINC "for-sale" subset was downloaded for the published analyses and further reduced in size using the ZINC_for-sale_curation.py script located in the /Python_scripts_and_notebooks/ directory. One test method curates the dataset further. After this is done, the curated datasets needs to be placed in the /src/test/resources/ directory for it to be analysed by other test methods.
    • ZINC "in-vitro": One test method curates the dataset. After this is done, the curated datasets needs to be placed in the /src/test/resources/ directory for it to be analysed by other test methods.
    • ZINC "biogenic": The ZINC "biogenic" dataset needs to be placed in the /src/test/resources/ directory to be used for the curation of the other datasets.
  • Manually curated review of bacterial natural products sugar moieties: Two of the test methods do a substructure search in COCONUT for sugar moieties reported in bacterial natural products, manually curated by Elshahawi et al.. This dataset is already supplied in this repository in the /src/test/resources/ directory.
  • ChEMBL: The ChEMBL 28 database is curated in one test method and analysed for glycosidic moieties in another. To run the curation test, the dataset has to be placed in the /src/test/resources/ directory as an SDF. After curation, the curated dataset has to be placed in the same directory.
  • DrugBank: The DrugBank "all structures" dataset is curated in one test method and analysed for glycosidic moieties in another. To run the curation test, the dataset has to be placed in the /src/test/resources/ directory as an SDF. After curation, the curated dataset has to be placed in the same directory.

Dependencies

References and useful links

Glycosylation statistics of COCONUT publication

Sugar Removal Utility

Chemistry Development Kit (CDK)

COlleCtion of Open NatUral producTs (COCONUT)

ZINC

MongoDB

RDKit

DrugBank

ChEMBL