/molecule_binding_prediction

The project utilises ML models and ensembles to predict molecular binding, leveraging fingerprints and protein features. It evaluates model performance, integrates calibration for refined predictions, and aims to optimise accuracy in chemical compound interactions. Tools: DuckDB, RDKit, XGBoost, CatBoost, LightGBM, Ensemble Learning,Calibration

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Ensemble Learning for Molecular Binding Prediction

The project utilises ML models and ensembles to predict molecular binding, leveraging fingerprints and protein features. It evaluates model performance, integrates calibration for refined predictions, and aims to optimise accuracy in chemical compound interactions. Tools: DuckDB, RDKit, XGBoost, CatBoost, LightGBM, Ensemble Learning,Calibration

The purpose is to develop machine learning (ML) models that can accurately predict the binding affinity of small molecules to specific protein targets. This task is crucial in drug development, as it helps identify potential drug candidates more efficiently. Traditional methods of identifying such candidates involve physically testing each molecule's interaction with the protein target, which is laborious and time-consuming.

This code is a comprehensive pipeline for molecular binding prediction. It involves:

  • Data processing using Pandas and DuckDB. Screenshot 2024-04-26 at 2 48 32 PM

  • Molecular fingerprint generation using RDKit.

    Screenshot 2024-04-26 at 2 47 59 PM
  • Model training with various classifiers including Random Forest, XGBoost, LightGBM, CatBoost, SVM, and MLP. Screenshot 2024-04-26 at 2 50 15 PM

  • Ensemble techniques such as Stacking, Blending, and Bagging.

  • Evaluation of individual models and ensembles using Average Precision Score. Screenshot 2024-04-26 at 2 50 23 PM

  • Test set predictions, calibration, and ensemble predictions.