EMCIP: An Ensemble Model for Cdr1 Inhibitor Prediction

We herewith introduce EMCIP, an Ensemble Model for Cdr1 Inhibitor Prediction, featured in our paper:

Trinh, T.-C., Falson, P., Tran-Nguyen, V.-K.* & Boumendjel, A.* Ligand-Based Drug Discovery Leveraging Traditional Machine Learning and Deep Learning Methodologies Exemplified by Prediction of Cdr1 Inhibitors. (2024)

Installation
EMCIP GUI
Additional Information
Contributors
Contact
Acknowledgments

Installation

To set up the environment for EMCIP, you will need Conda (v.24.1.2). For Conda installation instructions, refer to this link.

Run the following commands in your terminal to install the GUI for EMCIP:

git clone https://github.com/trinhthechuong/Cdr1_inhibitors.git
cd Cdr1_inhibitors
conda env create --file environment.yml
conda activate EMCIP_env
pip install -r requirements.txt
streamlit run EMCIP.py

EMCIP GUI for AI Non-Expert Users

EMCIP provides two functionalities: Batch Prediction and Molecule Prediction.

Batch Prediction

In the main menu, select the Predict a batch option and follow these steps:

Upload a *.csv file. The file must contain two columns: the first for molecule names or IDs, and the second for the SMILES of these molecules. Then, provide names for your columns and choose the number of processors for the calculation.
Click the Featurize button and wait for the process to complete. During this step, your molecules will be standardized and converted into various molecular representations (RDK5, RDK6, RDK7, Avalon, Mordred, Gobbi Pharmacophore) and 3D molecular graphs.
Once featurization is complete, click the Prediction button to predict your data.

All featurized datasets and prediction results are saved in the Cdr1_classification folder. Click the Restart button to start predicting another file.

Molecule Prediction

In the main menu, select the Predict a molecule option and follow these steps:

Enter the SMILES string of your molecule.
Click the Predict button and wait for the completion of the process. During this step, your molecule will be standardized, converted into various molecular representations (RDK5, RDK6, RDK7, Avalon, Mordred, Gobbi Pharmacophore) and 3D molecular graphs, before being evaluated by the EMCIP model for prediction.
The output is the predicted probability of your molecule being a Cdr1 inhibitor. Additionally, you can interact with the generated conformations used as input for MIL-3D-GNN.

Hugging Face Version

The EMCIP model is also available for direct prediction on the Hugging Face platform EMCIP-Hugging Face. However, for optimal performance, we recommend installing EMCIP locally to leverage the power of your local processors.

You can find the instructional video on how to use our model here.

Additional Information

The dataset folder stores all training data and corresponding results.

Datasets

original_dataset.csv: This file contains all assembled molecules for EMCIP along with their references.
Featurized_data folder:
- BM_stratified_sampling: This sub-folder stores all datasets used for training (training set), validation (external test set, and hard test set).
- MIL_3D_GNN: This sub-folder stores graph datasets specifically used for the MIL-3D-GNN model.

Molecular Representation Meta-Analysis

molecular_representation_analysis sub-folder contains all 16 ligand-based structural representation datasets and the results of the associated meta-analysis, including Wilcoxon signed-rank test.

Traditional Machine Learning Model Selection

ml_model_selection sub-folder stores all validation results, including Bemis-Murcko Scaffold 5-fold cross-validation and external test set validation of traditional machine learning models.
bayesian_estimation sub-folder houses the results comparing machine learning model performance through Bayesian estimation

MIL-3D-GNN

Validation results for MIL-3D-GNN on validation, external, and hard test sets are stored in the validation_mil_3d_gnn subfolder.
To view the hyperparameter tuning process for MIL-3D-GNN on mlflow, run the following commands in your terminal:

cd MIL_3D_GNN
mlflow server --host 127.0.0.1 --port 8080

graph_featurization.ipynb: This Jupyter Notebook details the process of converting molecules into graph representations for use with the MIL-3D-GNN model.

Contributors

Contact

For further queries, please contact:

The-Chuong Trinh: the-chuong.trinh@etu.univ-grenoble-alpes.fr, thechuong123@gmail.com
Dr. Viet-Khoa Tran-Nguyen: viet-khoa.tran-nguyen@u-paris.fr, khoatnv1993@gmail.com
Pr. Achène Boumendjel: ahcene.boumendjel@univ-grenoble-alpes.fr

trinhthechuong/Cdr1_inhibitors