/tgpipe

Pipeline for prediction of monomer glass transition temperature, and development of QSAR/QSPR models via mol2vec

Primary LanguageJupyter NotebookBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

logo for tgboost

tgBoost

tgBoost is a pipeline englobing QSPR model optimized for the prediction of the glass transition temperature (Tg) of monomer organic compounds. The pipeline is based on mol2vec, a machine learning (ML) algorithm converting molecular SMILES into molecular embeddings. The pipeline can be exapanded to include further QSAR/QSPR models developed from SMILES notation.

Motivation

tgBoost is a kickstart project aiming at expanding the use of ML, Data Engineering and QSAR/QSPR models in atmospheric and physical chemistry. The pipeline comes with a pretrained and ML powered QSPR model predicting Tg of monomer organic compounds. The model is based on a Extreme Gradient Boosting framework (XGBoost) and it is developed from the largest dataset of experimental Tg of monomer organic molecules (Koop et al., 2011).

Requirements

Installation

pip install https://github.com/U0M0Z/tgpipe

tgBoost library needs the independent installation of mol2vec via pip within the working environment:

pip install git+https://github.com/samoturk/mol2vec

Build status

Build status of continus integration i.e. travis, appveyor etc. Ex. -

Build Status Windows Build Status

Documentation

Details on the statistical analysis performed to develop the model and pipeline are found in the supporting article.

Usage

Basic use

This code uses the tgPipeline to train tgBoost a QSPR model for Tg prediction. The QSPR model is based on rdkit, mol2vec and xgboost. In order to use the model on your machine, you need to retrain the model to be conform to the C++ signature of your processor.

The tgBoost model is built, trained, and saved in ./trained_models with the command:

python tgPipeline/tgboost/train_pipeline.py

Check for the following message to confirm successful model training:

*** EXTRACTION step
n_input SMILES:  415 

*** TRANSFORMING step
n_output SMILES:  298 

~~ DATA info
Xtrain:  298 ytrain:  298 Xtest:  0 ytest:  0 

*** REGRESSION step

PIPELINE completed:
_ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ ^ ~ _ ~ ^ ~ _
  __       ___                __ 
 / /____ _/ _ )___  ___  ___ / /_
/ __/ _ `/ _  / _ \/ _ \(_-</ __/
\__/\_, /____/\___/\___/___/\__/ 
   /___/                         
_ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ ^ ~ _ ~ ^ ~ _

As python module

from tgboost import tgboost.processing.smiles_manager as sm
from tgboost import predict

The first line imports functions to open and preprocess files containing SMILES used for predictions, and the second line imports functions for predicting Tg of SMILES.

Check notebooks repository for examples and details.

How to cite?

✨ 🍰 ✨

@Article{D1EA00090J,
author ="Galeazzo, Tommaso and Shiraiwa, Manabu",
title  ="Predicting glass transition temperature and melting point of organic compounds via machine learning and molecular embeddings",
journal  ="Environ. Sci.: Atmos.",
year  ="2022",
volume  ="2",
issue  ="3",
pages  ="362-374",
publisher  ="RSC",
doi  ="10.1039/D1EA00090J",
url  ="http://dx.doi.org/10.1039/D1EA00090J"
}

Contribute

Contact at tommaso.galeazzo@gmail.com

Credits

Initial development was supported by AirUCI, Irvine, CA.

License

BSD 3-clause © Tommaso Galeazzo