This is the source codes of the baselines in JOVA. The source code of the entire project can be found here. In the project, the works used as baselines are:
- SimBoost
- KronRLS
- IVPGAN
- ECFP-PSC, based on PADME-ECFP
- GraphConv-PSC, based on PADME-GraphConv
- CPI-Reg, based on CPI
The implementation information are presented in the paper. work for Drug-Target Indication (DTI) predictions but was not included in the write-up.
In this documentation, we explain the processes needed to run each of the methods mentioned above. We assume that a Linux OS is in use.
Library/Project | Version |
---|---|
Python | 3.7 |
RDKit | 2019.09.3.0 |
Pytorch | 1.3.0 |
Numpy | 1.18.4 |
XGBoost | 0.90 |
Pandas | 1.0.3 |
Seaborn | 0.9.0 |
Soek | 0.0.1 |
torch-scatter | 2.0.5 |
BioPython | 1.76 |
Scikit-Learn | 0.23.1 |
tqdm | 4.35.0 |
To install the dependencies, we suggest you install Anaconda first and then follow the commands below:
- Create anaconda environment
$ conda create -n jova python=3.7
- Activate environment
$ conda activate jova
- Install the dependencies above according to their official websites or documentations.
For instance, you can install
XGBoost
using the command$ pip install xgboost==0.90
The Davis, Metz, KIBA, and Toxcast datasets are for regression and from PADME.
The EGFR_1M17, EGFR Case Study, and EGFR unfiltered were constructed in our project for the case studies described in the paper. In EGFR_1M17 the 1M17 sequence of EGFR is used whereas the EGFR UniProt sequence is used in EGFR Case Study. The compounds in EGFR_1M17 and EGFR Case Study are the DrugBank compounds which are not part of the KIBA dataset.
The Celegans and Human datasets are for binary classification and from the CPI project.
The protein directory contains the files for facilitating learning protein embeddings.
We constructed the protein_words_dict.pkl file following the
Prot2Vec section described in the supplementary document using
all prot_info.csv
files in data.
Afterwards, the profile of each protein in, for instance, data/davis_data/prot_desc.csv
and ../../data/egfr_1M17/prot_desc_pdb_1M17.csv
is generated by executing:
$ python build_prot_vocabs.py --vocab ../../data/protein/protein_words_dict.pkl --prot_desc_path ../../data/davis_data/prot_desc.csv --prot_desc_path ../../data/egfr_1M17/prot_desc_pdb_1M17.csv
In the binary classification case we construct the protein profile for each dataset separately. For instance, on the Human dataset, we construct the profile using:
$ python build_prot_vocabs_cpi --prot_desc_path ../../data/human_data/prot_desc.csv
First ensure you are in the project directory and set it up with:
$ pip install -e .
Then cd
into the dti
directory of the project with:
$ cd proj/dti/
-
simboost.py
To run the SimBoost implementation, you first need to run the Matrix Factorization (MF) experiment. The MF experiment produces two files after execution. The first is a
.mod
file which stores the weights of the trained model and the second has the suffix_mf_simboost_data_dict.pkl
which is a python dictionary of compound-target features needed in the SimBoost feature construction stage. This can be achieved with:$ python mf.py --dataset_name davis --dataset_file ../../data/davis_data/restructured.csv --prot_desc_path ../../data/davis_data/prot_desc.csv --comp_view ecfp8 --prot_view psc
Once the
*__mf_simboost_data_dict.pkl
is created, you can run the SimBoost experiment using:$ python simboost.py --dataset_name davis --dataset_file ../../data/davis_data/restructured.csv --prot_desc_path ../../data/davis_data/prot_desc.csv --model_dir ./model_dir/davis --filter_threshold 6 --comp_view ecfp8 --prot_view psc --fold_num 5 --mf_simboost_data_dict davis_MF_kiba_ecfp8_psc_mf_simboost_data_dict.pkl
-
kronrls.py
A sample command to run the KronRLS experiment on the KIBA dataset is in kronrls_cv.sh. -
train_joint.py
This script trains a model similar to that integrates the ECFP and GraphConv features of a compound and the PSC of a protein using the MSE loss function. train_joint_cv.sh shows a sample run command. -
train_joint_gan.py
The IVPGAN baseline. Revised implementation of the initial version -
singleview.py
This script trains models that use unimodal representation of compounds and targets. A sample run command on the Metz dataset can be found in singleview.sh.To train an ECFP-PSC model, set the
comp_view
andprot_view
flags to--comp_view ecfp --prot_view psc
in singleview.shThe possible compound views are
[psc, rnn, pcnn, p2v]
and the possible target views are[ecfp4, ecfp8, weave, gconv, gnn]
. -
cpi_baseline.py
Implements the CPI-reg baseline. Thus,--comp_view gnn --prot_view pcnna
. Sample run file: cpi_baseline.shThe
pcnn/pcnna
andgnn
views are due to Tsubaki et al..
-
cpi_baseline_bin.py
Follows the implementation of CPI_prediction.
Sample run file: cpi_baseline_bin_cv_human.sh
To evaluate a model you will need to specify the --eval
, --eval_model_name
, and --model_dir
flags.
Please see the command line arguments in each file for more on model evaluation.
-
worker_jova.py
Used to preprocess experiment data for analysis.
We acknowledge the authors of the PADME project for their work. Our project uses the data, data loading, and metric procedures published by their work and we're grateful.
We acknowledge the authors and contributors of the DeepChem project for their implementations of the Graph Convolution, Weave, and other featurization schemes; the GraphConv and Weave implementations in this work are basically our Pytorch translations of their initial works.
We also acknowledge the CPI_prediction
project for the PCNNA
and GNN
implementations. We re-organized the GNN
implementation
into a DeepChem compatible featurization scheme in this project.
Thanks to Yulkang for the work in numpytorch.py.
@article{Agyemang2020,
archivePrefix = {arXiv},
arxivId = {2005.00397},
author = {Agyemang, Brighter and Wu, Wei-Ping and Kpiebaareh, Michael Yelpengne
and Lei, Zhihua and Nanor, Ebenezer and Chen, Lei},
doi = {10.1016/j.jbi.2020.103547},
eprint = {2005.00397},
issn = {1532-0464},
journal = {Journal of Biomedical Informatics},
keywords = {Drug–target interactions,Machine learning,Represen,drug,target interactions},
number = {August},
pages = {103547},
publisher = {Elsevier Inc.},
title = {{Multi-View Self-Attention for Interpretable Drug-Target Interaction Prediction}},
url = {https://pubmed.ncbi.nlm.nih.gov/32860883/},
volume = {110},
year = {2020}
}