/KG4SL

Synthetic lethality (SL) is a promising gold mine for the discovery of anti-cancer drug targets. KG4SL is the first graph neural network (GNN)-based model that uses knowledge graph for SL prediction.

Primary LanguagePythonMIT LicenseMIT

KG4SL

Synthetic lethality (SL) is a promising gold mine for the discovery of anti-cancer drug targets. KG4SL is a novel graph neural network (GNN)-based model, by incorporating knowledgegraph message-passing into SL prediction. The knowledge graph was constructed using 11 kinds of entities including genes,compounds, diseases, biological processes, and 24 kinds of relationships that could be pertinent to SL. The integration of knowledge graph can help harness the independence issue and circumvent manual feature engineering by conducting message-passing on the knowledge graph.

Dataset collection

The data used to train and test the KG4SL was downloaded from a comprehensive database of synthetic lethal gene pairs named SynLethDB (http://synlethdb.sist.shanghaitech.edu.cn/v2/#/). The version we used includes a set of 36,402 human SL pairs, as well as a knowledge graph (KG) with 11 kinds of entities and 24 kinds of relationships. The KG that stores the information pertinent to SL was constructed based on SynLethDB and Hetionet. For details of the data, please refer to our paper 'KG4SL: Knowledge Graph Neural Network for Synthetic Lethality Prediction in Human Cancers'. Here we listed the information of the SL pairs and knowledge graph.

image image

Because SynLethDB is continuously updated, if you wish to use the latest version of the data, please download it directly from the SynLethDB website.

Files in the folder:

> data
    1. input datasets for prepocessing.py
        kg_triplet: head, relation, tail
        relation2id: relation_name, relation_id
        sl_data: gene_a, gene_b
    2. input datasets for main.py (The datasets below are generated through preprocessing.py.)
        sl2id: gene_a_id, gene_b_id, 0/1
        kg2id: head_id, relation_id, tail_id
        entity2id: origin_id, new_id
    3. input dataset for train.py
        dbid2name: _id, name

> results
    eval_data_final_1_X: dataset used for validation
    test_data_final_1: dataset used for test
    test_data_mapping_final_1: dataset uesd for test with gene names
    train_data_final_1_X: dataset used for train
    loss_curve_final_1_X: save the values of losses and three metrics within the increase of epochs
    training_curve_final_1_X: training curve (It is recommended that you turn off the early stop mechanism while getting the training curve.)
    
    Note: 
    The first number in the file naming process represents the process of retrieving test data, which is partitioned only once in this article. 
    The second number that appears represents the process of dividing train data and validation data, which is repeated five times in this article. 
    The third number that appears indicates that in n_epoch, the optimal result appears for the X time.
> src
    Implementations of KG4SL.

Running the code:

cd src
python preprocessing.py
python main.py

Requirements:

python36
tensorflow-gpu 1.15.0
torch 1.1.0
dgl 0.5.2
pandas 1.1.5
scikit-learn 0.24.0
matplotlib 3.3.3

Note: You can install all required packages through the command 'pip install -r pip_install.txt'.

Supplementary Files:

Results of some additional experiments can be found in the Supplementary Materials.

Acknowledgments:

The code was inspired by KGNN-LS and KGCN.

Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems
Wang, Hongwei, et al. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019.

Knowledge Graph Convolutional Networks for Recommender Systems
Wang, Hongwei, et al. arXiv preprint arXiv:1904.12575 (2019).

SynLethKG was constructed based on SynLethDB 1.0, SynLethDB 2.0 and Hetionet.

SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets
Guo, Jing, Hui Liu, and Jie Zheng. Nucleic Acids Research, Vol. 44, Issue D1 (2016): D1011-D1017.

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Himmelstein, Daniel Scott, et al. eLife 6 (2017): e26726.

How to cite KG4SL:

@article{wang2021kg4sl,
  title={KG4SL: knowledge graph neural network for synthetic lethality prediction in human cancers},
  author={Wang, Shike and Xu, Fan and Li, Yunyang and Wang, Jie and Zhang, Ke and Liu, Yong and Wu, Min and Zheng, Jie},
  journal={Bioinformatics},
  volume={37},
  number={Supplement\_1},
  pages={i418--i425},
  year={2021},
  publisher={Oxford University Press}
}