This is Ersilia's week 2 task for Outreachy Summer 2024 contributors. This repository is a model validation for model eos2ta5
hERG channel blockage has been a problem in small molecules during drug development and the side effect of this blockade is the increased risk of cardiotoxicity. This model tries to solve the problem of hERG blockage in small molecules by classifying druglike molecules as hERG blockers and hERG non-blockers. A druglike molecule is considered hERG blocker if the probability is >=0.5 and a non-blocker if the probability is <0.5.
- EOS model ID: eos2ta5
- Slug: cardiotoxnet-herg
- Task: Classification
- Output: Probability
- Output Type: Float
- Interpretation: Probability that the compound inhibits hERG (IC50 < 10 uM)
This model predicts hERG blockade based on ligands which is of utmost importance in drug discovery. The model is a classification model which returns the probability that the compound inhibits hERG which was set at IC50 < 10 uM by the authors. The characteristics of the eos2ta5 model involve inputting a single compound for classification, which returns a single probability value (in float) as output.
Tested on Ubuntu and Python version >=3.7 and <=3.11.
- Install Ersilia's model. The link can be found here
- Fetch the model
ersilia -v fetch eos2at5
- Serve the model
ersilia -v serve eos2at5
- Run Predictions using a dataset that has SMILES as a column
ersilia -v api run -i input.csv -o output.csv
The process can be found here
The aim was to download, fetch, serve, and run prediction to test the model on Ersilia to know if it's working. The process was done using Esilia on Google Collab. I used a random dataset with 9 records that have SMILES columns to test model eos2ta5. The output can be found in the notebook folder. Link of the testing model notebook: eos2ta5_model
- Summary: The output returns the following columns: keys, input, and probability. Hence, model eos2ta5 works perfectly well.
This task aims to select a list of 1000 molecules from public repositories and ensure they are represented as standard SMILES. I acquired the dataset from the PubChem database. This dataset downloaded contains about 2265 rows with numerous fields. The SMILES column is titled "canonicalsmiles" in the dataset downloaded from PubChem. Since the SMILES was in canonical format, I decided to convert it to standardized smiles, which will be useful in running prediction So I cleaned the dataset, filtered out unnecessary columns, and selected random 1000 records which can be found in this path
- Data Cleaning - code containing the cleaning process in Python
- 1000molecules.csv - contains a list of random 1000 molecules from the dataset downloaded from the PubChem database.
- Summary: The file cleaned has 1000 molecules and 3 fields namely: canonical smiles, inchikeys, and molecular weight.
The aim is to obtain a prediction on the 1000 molecules obtained from a public repository and evaluate the result using a scatter plot.
- 1000molecules_prediction.csv - output of the predicted value using the 1000molecule data.
To evaluate the result of the model, I set a threshold probability value of 0.5. So I classified my probability values as hERG blockers and hERG non-blockers. The result is evaluated in a scatterplot and barchat. All the plots can be found in this link
The aim is to reproduce the result obtained from the Publication paper
TOOL USED: Ubuntu I took the following steps:
- I already had conda installed in my system
- Setup cardiotox on conda environment
# create a conda environment
conda create -n cardiotox python=3.7.7
# activate the environment
conda activate cardiotox
- Installed PyBioMed and return back to the home directory
cd cardiotox
cd PyBioMed
python setup.py install
cd ..
- Installed the exact package version the author used
pip install tensorflow==2.3.1
pip install sklearn==0.0
pip install mordred==1.2.0
pip install pybel==0.14.10
pip install keras==2.4.3
- To test the model, I ran this code
python test.py
The output obtained was the exact result from the author's publication. When trying to implement the author's source, I advise it should done on Ubuntu. Using Juypter Notebook or Google Collab resulted in an error because the version of the package used by the author was outdated this might be because the publication is up to 3 years. Using pip to download the exact version for the package will return an error saying that "no version of this can be found".
The data used to test the reproducibility of CardioTox was downloaded from the github page
- external_test_pos.csv - is the downloaded data gotten from the publication GitHub Page
- external_test_neg.csv - is the downloaded data gotten from the publication GitHub Page
- Summary: The two datasets contain 44 records and two columns namely: ACTIVITY & smiles
The aim was to run a prediction on the ersilia model eos2ta5. I took the following steps to achieve the reproducibility of output data.
- Fetch the model eos2ta5 from docker using:
docker pull ersiliaos/eos2ta5:latest
- Serve model eos2ta5 using:
ersilia -v serve eos2ta5
- Ran prediction
ersilia -v api run -i external_test_pos.csv -o reproducibility_prediction_output.csv
ersilia -v api run -i external_test_pos.csv -o test2_reproducibility_prediction_output.csv
- reproducibility_prediction_output - output data gotten after prediction of model eos2ta5 Ersilia Model specifically model eos2ta5
- test2_reproducibility_prediction_output - output data gotten after prediction of model eos2ta5
- Summary: These two output files returned a dataset containing 44 records and three columns namely: key, input, and probability.
The tool used is Jupyter Notebook and the code can be found here
I used the same evaluation criteria used in the publication paper to compare the results From this result:
Using the same evaluation criteria to compare the two results, both have similar results. Hence, the model is reproducible.
#WEEK 3:
The experimental dataset used was obtained from a publication and the link to the data is found here.
In this task, I ensure that inchikey present in the experimental dataset (used for performance evaluation) is not included in the training dataset used to build the predictive model. In the process of data leakage, The number of molecules from external datasets present in training datasets is: 7740. I dropped those leaked data. For accuracy and model performance, it's advisable to always remove leaked data to avoid model biases and to improve model performance. Thus, making your evaluation dataset independent from training dataset. The sources of the data leakage were the public repositories where the dataset was obtained. Both the experimental dataset and training dataset used by the author were gotten from Chembl, Pubchem
Molecules | |
---|---|
Training Data | 12620 |
Validation Data | 870 |
The aim was to run a prediction on the ersilia model eos2ta5 using the dataset that is to be used to validate the model. This was done on Google collab
The model falls under the classification type. So, I used several evaluation metrics that are commonly used in classification model. The evaluation metrics used was MCC, NPV, PPV, ACC, SEN, SPE, B-ACC & AUROC curve. SUMMARY:
Data | Model | MCC | NPV | ACC | PPV | SPE | SEN | B-ACC | AUC SCORE |
---|---|---|---|---|---|---|---|---|---|
Validation Dataset | eos2ta5 | 0.326 | 0.573 | 0.661 | 0.748 | 0.693 | 0.639 | 0.666 | 0.7 |