/syba

Synthetic Bayesian Classification

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

SYBA

SYnthetic BAyesian classifier (SYBA) is a Python package for the classification of organic compounds as easy-to-synthesize (ES) or hard-to-synthesize (ES). SYBA is a fragment-based method. The molecule is decomposed into ECFP4-like fragments, a fragment score is assigned to each fragment and all fragment scores are summed up to give the resulting SYBA score. If SYBA score is positive, the molecule is considered to be ES, otherwise it is considered to be HS. Fragment scores are the part of the SYBA algorithm and they were obtained by the analysis of the frequency of fragments in the databases of ES and HS compounds. ES compounds were obtained by a random selection from the ZINC15 [http://zinc.docking.org/] database, HS compounds were generated by the Nonpher [https://github.com/lich-uct/nonpher] approach. More details can be found in SYBA [as soon as accepted] and Nonpher [http://dx.doi.org/10.1186/s13321-017-0206-2] papers.

Instalation

Prerequisities

Supported platforms:

  • All platforms

Dependencies

Installation with Anaconda

SYBA is distributed as a Conda package. Conda is an open source package management system and environment management system that makes setting up a development environment for any project very easy. To install Conda package, you have to get either full Anaconda [https://www.anaconda.com/] distribution or its lightweight variant, Miniconda [https://docs.conda.io/en/latest/miniconda.html]. SYBA is installed from Anaconda/Miniconda by running the following command from the Linux terminal:

conda install -c rdkit -c lich syba

Installation with setup.py

Once you have RDKit[https://github.com/rdkit/rdkit] installed, you can install SYBA from its directory with the following command:

python setup.py install

Quick start

SYBA input is a CSV (comma-separated value) file consisting of the following columns: CMPND_ID,SMILES,OTHER_COLUMNS. OTHER_COLUMNS can contain any additional data and these columns are skipped. Output is a CSV file in the format ID,SMILES,SYBA_SCORE. SYBA reflects how confident the classifier is with its prediction (i.e., SYBA score can't be considered as a measure of the ease of synthesis). Negative SYBA values mean a hard-to-synthesize compound and positive mean an easy-to-synthesize one.

SYBA is automatically installs a command line tool syba. SYBA classification is performed by the following command:

$ syba [INPUT_FILE [OUTPUT_FILE]]

Use in Python script

Basic usage

from rdkit import Chem
from syba.syba import SybaClassifier

syba = SybaClassifier()
syba.fitDefaultScore()
smi = "O=C(C)Oc1ccccc1C(=O)O"
syba.predict(smi)
# syba works also with RDKit RDMol objects
mol = Chem.MolFromSmiles(smi)
syba.predict(mol=mol)
# syba.predict is actually method with two keyword parameters "smi" and "mol", if both provided score is calculated for compound defined in "smi" parameter has the priority
syba.predict(smi=smi, mol=mol)

SYBA workflow

SYBA training (i.e., SYBA fragment score calculation) is demonstrated in Jupyter notebook accessible in docs/notebooks/prepare_fragment_counts.ipynb. The example of SYBA, as well as SAScore, SCScore and Random forest, classification for a new compound is available in docs/notebooks/prepare_results.ipynb Jupyter notebook. Jupyter notebook can be installed from Conda with the command conda install jupyter.