Overview

This is the source codes of the baselines in JOVA. The source code of the entire project can be found here. In the project, the works used as baselines are:

SimBoost
KronRLS
IVPGAN
ECFP-PSC, based on PADME-ECFP
GraphConv-PSC, based on PADME-GraphConv
CPI-Reg, based on CPI

The implementation information are presented in the paper. work for Drug-Target Indication (DTI) predictions but was not included in the write-up.

In this documentation, we explain the processes needed to run each of the methods mentioned above. We assume that a Linux OS is in use.

Requirements

Dependencies

Library/Project	Version
Python	3.7
RDKit	2019.09.3.0
Pytorch	1.3.0
Numpy	1.18.4
XGBoost	0.90
Pandas	1.0.3
Seaborn	0.9.0
Soek	0.0.1
torch-scatter	2.0.5
BioPython	1.76
Scikit-Learn	0.23.1
tqdm	4.35.0

To install the dependencies, we suggest you install Anaconda first and then follow the commands below:

Create anaconda environment
```
$ conda create -n jova python=3.7
```
Activate environment
```
$ conda activate jova
```
Install the dependencies above according to their official websites or documentations. For instance, you can install XGBoost using the command
```
$ pip install xgboost==0.90
```

Datasets

Regression

The Davis, Metz, KIBA, and Toxcast datasets are for regression and from PADME.

The EGFR_1M17, EGFR Case Study, and EGFR unfiltered were constructed in our project for the case studies described in the paper. In EGFR_1M17 the 1M17 sequence of EGFR is used whereas the EGFR UniProt sequence is used in EGFR Case Study. The compounds in EGFR_1M17 and EGFR Case Study are the DrugBank compounds which are not part of the KIBA dataset.

Classification

The Celegans and Human datasets are for binary classification and from the CPI project.

Protein

Regression experiments

The protein directory contains the files for facilitating learning protein embeddings. We constructed the protein_words_dict.pkl file following the Prot2Vec section described in the supplementary document using all prot_info.csv files in data. Afterwards, the profile of each protein in, for instance, data/davis_data/prot_desc.csv and ../../data/egfr_1M17/prot_desc_pdb_1M17.csv is generated by executing:

$ python build_prot_vocabs.py  --vocab ../../data/protein/protein_words_dict.pkl --prot_desc_path ../../data/davis_data/prot_desc.csv --prot_desc_path ../../data/egfr_1M17/prot_desc_pdb_1M17.csv

Classification experiments

In the binary classification case we construct the protein profile for each dataset separately. For instance, on the Human dataset, we construct the profile using:

$ python build_prot_vocabs_cpi --prot_desc_path ../../data/human_data/prot_desc.csv

Usage

First ensure you are in the project directory and set it up with:

$ pip install -e .

Then cd into the dti directory of the project with:

$ cd proj/dti/

Regression

simboost.py

To run the SimBoost implementation, you first need to run the Matrix Factorization (MF) experiment. The MF experiment produces two files after execution. The first is a .mod file which stores the weights of the trained model and the second has the suffix _mf_simboost_data_dict.pkl which is a python dictionary of compound-target features needed in the SimBoost feature construction stage. This can be achieved with:
```
$ python mf.py --dataset_name davis --dataset_file ../../data/davis_data/restructured.csv --prot_desc_path ../../data/davis_data/prot_desc.csv --comp_view ecfp8 --prot_view psc
```
Once the *__mf_simboost_data_dict.pkl is created, you can run the SimBoost experiment using:
```
$ python simboost.py --dataset_name davis --dataset_file ../../data/davis_data/restructured.csv --prot_desc_path ../../data/davis_data/prot_desc.csv --model_dir ./model_dir/davis --filter_threshold 6 --comp_view ecfp8 --prot_view psc --fold_num 5 --mf_simboost_data_dict davis_MF_kiba_ecfp8_psc_mf_simboost_data_dict.pkl
```
kronrls.py A sample command to run the KronRLS experiment on the KIBA dataset is in kronrls_cv.sh.
train_joint.py This script trains a model similar to that integrates the ECFP and GraphConv features of a compound and the PSC of a protein using the MSE loss function. train_joint_cv.sh shows a sample run command.
train_joint_gan.py The IVPGAN baseline. Revised implementation of the initial version
singleview.py This script trains models that use unimodal representation of compounds and targets. A sample run command on the Metz dataset can be found in singleview.sh.

To train an ECFP-PSC model, set the comp_view and prot_view flags to --comp_view ecfp --prot_view psc in singleview.sh

The possible compound views are [psc, rnn, pcnn, p2v] and the possible target views are [ecfp4, ecfp8, weave, gconv, gnn].
cpi_baseline.py Implements the CPI-reg baseline. Thus, --comp_view gnn --prot_view pcnna. Sample run file: cpi_baseline.sh

The pcnn/pcnna and gnn views are due to Tsubaki et al..

Classification

cpi_baseline_bin.py

Follows the implementation of CPI_prediction.

Sample run file: cpi_baseline_bin_cv_human.sh

To evaluate a model you will need to specify the --eval, --eval_model_name, and --model_dir flags. Please see the command line arguments in each file for more on model evaluation.

Others

worker_jova.py

Used to preprocess experiment data for analysis.

Credits

We acknowledge the authors of the PADME project for their work. Our project uses the data, data loading, and metric procedures published by their work and we're grateful.

We acknowledge the authors and contributors of the DeepChem project for their implementations of the Graph Convolution, Weave, and other featurization schemes; the GraphConv and Weave implementations in this work are basically our Pytorch translations of their initial works.

We also acknowledge the CPI_prediction project for the PCNNA and GNN implementations. We re-organized the GNN implementation into a DeepChem compatible featurization scheme in this project.

Thanks to Yulkang for the work in numpytorch.py.

Cite

@article{Agyemang2020,
archivePrefix = {arXiv},
arxivId = {2005.00397},
author = {Agyemang, Brighter and Wu, Wei-Ping and Kpiebaareh, Michael Yelpengne 
and Lei, Zhihua and Nanor, Ebenezer and Chen, Lei},
doi = {10.1016/j.jbi.2020.103547},
eprint = {2005.00397},
issn = {1532-0464},
journal = {Journal of Biomedical Informatics},
keywords = {Drug–target interactions,Machine learning,Represen,drug,target interactions},
number = {August},
pages = {103547},
publisher = {Elsevier Inc.},
title = {{Multi-View Self-Attention for Interpretable Drug-Target Interaction Prediction}},
url = {https://pubmed.ncbi.nlm.nih.gov/32860883/},
volume = {110},
year = {2020}
}

bbrighttaer/jova_baselines

Overview

Requirements

Dependencies

Datasets

Regression

Classification

Protein

Regression experiments

Classification experiments

Usage

Regression

Classification

Others

Credits

Cite