eos2ta5-model-validation

This is Ersilia's week 2 task for Outreachy Summer 2024 contributors. This repository is a model validation for model eos2ta5

Model Abstract

hERG channel blockage has been a problem in small molecules during drug development and the side effect of this blockade is the increased risk of cardiotoxicity. This model tries to solve the problem of hERG blockage in small molecules by classifying druglike molecules as hERG blockers and hERG non-blockers. A druglike molecule is considered hERG blocker if the probability is >=0.5 and a non-blocker if the probability is <0.5.

Model Characteristics

EOS model ID: eos2ta5
Slug: cardiotoxnet-herg
Task: Classification
Output: Probability
Output Type: Float
Interpretation: Probability that the compound inhibits hERG (IC50 < 10 uM)

Summary of the model

This model predicts hERG blockade based on ligands which is of utmost importance in drug discovery. The model is a classification model which returns the probability that the compound inhibits hERG which was set at IC50 < 10 uM by the authors. The characteristics of the eos2ta5 model involve inputting a single compound for classification, which returns a single probability value (in float) as output.

Installation

Tested on Ubuntu and Python version >=3.7 and <=3.11.

Using Ubuntu

Install Ersilia's model. The link can be found here
Fetch the model

ersilia -v fetch eos2at5

Serve the model

ersilia -v serve eos2at5

Run Predictions using a dataset that has SMILES as a column

ersilia -v api run -i input.csv -o output.csv

Using Google Collab

The process can be found here

WEEK 2: Task 1

Testing of eos2ta5 model

The aim was to download, fetch, serve, and run prediction to test the model on Ersilia to know if it's working. The process was done using Esilia on Google Collab. I used a random dataset with 9 records that have SMILES columns to test model eos2ta5. The output can be found in the notebook folder. Link of the testing model notebook: eos2ta5_model

Summary: The output returns the following columns: keys, input, and probability. Hence, model eos2ta5 works perfectly well.

Data Acquisition

This task aims to select a list of 1000 molecules from public repositories and ensure they are represented as standard SMILES. I acquired the dataset from the PubChem database. This dataset downloaded contains about 2265 rows with numerous fields. The SMILES column is titled "canonicalsmiles" in the dataset downloaded from PubChem. Since the SMILES was in canonical format, I decided to convert it to standardized smiles, which will be useful in running prediction So I cleaned the dataset, filtered out unnecessary columns, and selected random 1000 records which can be found in this path

Data Cleaning - code containing the cleaning process in Python
1000molecules.csv - contains a list of random 1000 molecules from the dataset downloaded from the PubChem database.
Summary: The file cleaned has 1000 molecules and 3 fields namely: canonical smiles, inchikeys, and molecular weight.

Predictions for 1000molecules file

The aim is to obtain a prediction on the 1000 molecules obtained from a public repository and evaluate the result using a scatter plot.

1000molecules_prediction.csv - output of the predicted value using the 1000molecule data.

Result

To evaluate the result of the model, I set a threshold probability value of 0.5. So I classified my probability values as hERG blockers and hERG non-blockers. The result is evaluated in a scatterplot and barchat. All the plots can be found in this link

WEEK 2: TASK 2

Model Reproducibility

The aim is to reproduce the result obtained from the Publication paper

IMPLEMENTATION USING THE AUTHORS SOURCE CODE

TOOL USED: Ubuntu I took the following steps:

I already had conda installed in my system
Setup cardiotox on conda environment

# create a conda environment
conda create -n cardiotox python=3.7.7
# activate the environment
conda activate cardiotox

Installed PyBioMed and return back to the home directory

cd cardiotox
cd PyBioMed
python setup.py install
cd ..

Installed the exact package version the author used

pip install tensorflow==2.3.1
pip install sklearn==0.0
pip install mordred==1.2.0
pip install pybel==0.14.10
pip install keras==2.4.3

To test the model, I ran this code python test.py

OUTPUT:

Observation & Conclusion

The output obtained was the exact result from the author's publication. When trying to implement the author's source, I advise it should done on Ubuntu. Using Juypter Notebook or Google Collab resulted in an error because the version of the package used by the author was outdated this might be because the publication is up to 3 years. Using pip to download the exact version for the package will return an error saying that "no version of this can be found".

Data Acquisition

The data used to test the reproducibility of CardioTox was downloaded from the github page

external_test_pos.csv - is the downloaded data gotten from the publication GitHub Page
external_test_neg.csv - is the downloaded data gotten from the publication GitHub Page
Summary: The two datasets contain 44 records and two columns namely: ACTIVITY & smiles

Prediction for the Dataset downloaded from the publication

The aim was to run a prediction on the ersilia model eos2ta5. I took the following steps to achieve the reproducibility of output data.

Fetch the model eos2ta5 from docker using:

docker pull ersiliaos/eos2ta5:latest

Serve model eos2ta5 using:

ersilia -v serve eos2ta5

Ran prediction

ersilia -v api run -i external_test_pos.csv -o reproducibility_prediction_output.csv
ersilia -v api run -i external_test_pos.csv -o test2_reproducibility_prediction_output.csv

reproducibility_prediction_output - output data gotten after prediction of model eos2ta5 Ersilia Model specifically model eos2ta5
test2_reproducibility_prediction_output - output data gotten after prediction of model eos2ta5
Summary: These two output files returned a dataset containing 44 records and three columns namely: key, input, and probability.

Reproducibility Process

The tool used is Jupyter Notebook and the code can be found here

Result & Conclusion

I used the same evaluation criteria used in the publication paper to compare the results From this result:

Test set-I Result:
Test set-II Result:
Publication Result:

Using the same evaluation criteria to compare the two results, both have similar results. Hence, the model is reproducible.

#WEEK 3:

Dataset Used

The experimental dataset used was obtained from a publication and the link to the data is found here.

Data Leakage

In this task, I ensure that inchikey present in the experimental dataset (used for performance evaluation) is not included in the training dataset used to build the predictive model. In the process of data leakage, The number of molecules from external datasets present in training datasets is: 7740. I dropped those leaked data. For accuracy and model performance, it's advisable to always remove leaked data to avoid model biases and to improve model performance. Thus, making your evaluation dataset independent from training dataset. The sources of the data leakage were the public repositories where the dataset was obtained. Both the experimental dataset and training dataset used by the author were gotten from Chembl, Pubchem

Summary Statistics

	Molecules
Training Data	12620
Validation Data	870

Prediction on Model eos2ta5

The aim was to run a prediction on the ersilia model eos2ta5 using the dataset that is to be used to validate the model. This was done on Google collab

EVALUATION METRICS

The model falls under the classification type. So, I used several evaluation metrics that are commonly used in classification model. The evaluation metrics used was MCC, NPV, PPV, ACC, SEN, SPE, B-ACC & AUROC curve. SUMMARY:

Data	Model	MCC	NPV	ACC	PPV	SPE	SEN	B-ACC	AUC SCORE
Validation Dataset	eos2ta5	0.326	0.573	0.661	0.748	0.693	0.639	0.666	0.7

The PPV & NPV metrics indicating better performance in certain aspects of classification while the remaining evaluation metrics suggest moderate performance.AUC score of 0.70 also proves that the model has a predicting cability to distinguish between drug that are hERG blocker and hERG non-blocker. From this evaluation metrics, it shows that the model performed moderately well and have the predicting ability to identify hERG blocker.

Ajoke23/eos2ta5-model-validation

eos2ta5-model-validation

Model Abstract

Model Characteristics

Summary of the model

Installation

Using Ubuntu

Using Google Collab

WEEK 2: Task 1

Testing of eos2ta5 model

Data Acquisition

Predictions for 1000molecules file

Result

WEEK 2: TASK 2

Model Reproducibility

IMPLEMENTATION USING THE AUTHORS SOURCE CODE

Observation & Conclusion

Data Acquisition

Prediction for the Dataset downloaded from the publication

Reproducibility Process

Result & Conclusion

Dataset Used

Data Leakage

Summary Statistics

Prediction on Model eos2ta5

EVALUATION METRICS

References