DGCL: a contrastive learning method for predicting cancer driver genes based on graph diffusion

Primary LanguagePython


A contrastive learning method for predicting cancer driver genes based on graph diffusion


DGCL is a method for predicting cancer driver genes based on comparative learning under graph diffusion. Firstly, it obtains the corresponding gene-gene network from known protein-protein interaction relationships. Then, personalized PageRank is used for graph diffusion on this gene-gene network to obtain a diffusion-based gene-gene network. Next, comparative learning is performed on these two networks. During the comparative learning process, edge rounding and feature masking are applied to enhance the data of these two networks. The enhanced networks are input into a Chebyshev encoder with shared parameters for feature learning, utilizing neighborhood comparative learning loss as a constraint. Finally, the learned network features are further passed through a specific encoder for network-specific feature embedding learning. Both node classification and link prediction are used as constraints simultaneously. The final learned feature representations are concatenated and logistic regression classifiers are employed to learn the final feature representations in the two networks. The fusion of these features is utilized in a logistic regression model to predict cancer driver genes.


  • Python 3.7
  • Pytorch 1.9.1+cu111
  • torch Geometric 2.0.4
  • torch scatter 2.0.8
  • torch sparse 0.6.11
  • torch cluster 1.5.9
  • torch spline conv 1.2.1
  • pyyaml 6.0


File Name Format Size Description
CPDB_data.pkl -- -- This file contains the PPI network, gene features, gene names, and gene label information of the CPDB dataset.
ppi.pkl torch.sparse_coo 13627,13627 Adjacency matrix (sparse matrix) of PPI network.
ppi_selfloop.pkl torch.sparse_coo 13627,13627 Adjacency matrix (sparse matrix) of PPI network with self connection.
k_sets.pkl dict -- It preserves the data partitioning of the model during ten-fold cross-validation tests.
Str_feature.pkl tensor 13627,16 This is a structural feature obtained through the Node2VEC algorithm on PPI network.

Running DGCL

Firstly,you should set the hyperparameter of the model through the configuration file config.yaml.

  • drop_edge_rate_1:edge abandonment probability of Protein-protein network.

  • drop_edge_rate_2:edge abandonment probability of graph diffusion network.

  • drop_feature_rate_1:feature masking probability of Protein-protein network.

  • drop_feature_rate_2:feature masking probability of graph diffusion network.

  • tau:the neighborhood contrastive learning loss temperature hyperparameter, [0,1].

Then,you can run python train_DGCL.py --dataset=CPDB --cancer_type=pan-cancer

--datasetdefault is CPDB dataset,--cancer_typedefault is pan-cancer.

If you want to train a single cancer model, you can change the cancer_type for training, such as python train_DGCL.py --cancer_type=brca