Yue Liu1,2, Ke Liang1, Jun Xia2, Sihang Zhou1, Xihong Yang1, Xinwang Liu1, Stan Z. Li2
1National University of Defense Technology, 2Westlake University
Deep graph clustering, which aims to group the nodes of a graph into disjoint clusters with deep neural networks, has achieved promising progress in recent years. However, the existing methods fail to scale to the large graph with million nodes. To solve this problem, a scalable deep graph clustering method (Dink-Net) is proposed with the idea of dilation and shrink. Firstly, by discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner. Meanwhile, the cluster centers are initialized as learnable neural parameters. Subsequently, the clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss in an adversarial manner. By these settings, we unify the two-step clustering, i.e., representation learning and clustering optimization, into an end-to-end framework, guiding the network to learn clustering-friendly features. Besides, Dink-Net scales well to large graphs since the designed loss functions adopt the mini-batch data to optimize the clustering distribution even without performance drops. Both experimental results and theoretical analyses demonstrate the superiority of our method.
Table of Contents
Dataset | Type | # Nodes | # Edges | # Feature Dimensions | # Classes |
---|---|---|---|---|---|
Cora | Attribute Graph | 2,708 | 5,278 | 1,433 | 7 |
CiteSeer | Attribute Graph | 3,327 | 4,614 | 3,703 | 6 |
Amazon-Photo | Attribute Graph | 7,650 | 119,081 | 745 | 8 |
ogbn-arxiv | Attribute Graph | 169,343 | 1,166,243 | 128 | 40 |
Attribute Graph | 232,965 | 23,213,838 | 602 | 41 | |
ogbn-products | Attribute Graph | 2,449,029 | 61,859,140 | 100 | 47 |
ogbn-papers100M | Attribute Graph | 111,059,956 | 1,615,685,872 | 128 | 172 |
codes are tested on Python3.7
dgl-cu113==0.9.1.post1
munkres==1.1.4
networkx==2.8.3
numpy==1.23.2
scikit_learn==1.3.0
scipy==1.6.0
torch==2.0.1
torch-scatter==2.0.9
torch-sparse==0.6.12
torch-spline-conv==1.2.1
torch-geometric==2.1.0.post1
tqdm==4.65.0
wandb=0.15.4
ogb==1.3.6
--device | running device
--dataset | dataset name
--hid_units | hidden units
--activate | activation function
--tradeoff | tradeoff parameter
--lr | learning rate
--epochs | training epochs
--eval_inter | evaluation interval
--wandb | wandb logging
clone this repository and change directory to Dink-Net
git clone https://github.com/yueliu1999/Dink-Net.git
cd ./Dink-Net
unzip the datasets and model parameters
unzip -d ./data/ ./data/datasets.zip
unzip -d ./models/ ./models/models.zip
run codes with scripts
bash ./scripts/train_cora.sh
bash ./scripts/train_citeseer.sh
bash ./scripts/train_amazon_photo.sh
bash ./scripts/train_ogbn-arxiv.sh
or directly run codes with commands
python main.py --device cuda:0 --dataset cora --hid_units 512 --lr 1e-2 --epochs 200 --wandb
python main.py --device cuda:0 --dataset citeseer --hid_units 1536 --lr 5e-4 --epochs 200 --wandb
python main.py --device cuda:0 --dataset amazon_photo --hid_units 512 --lr 1e-2 --epochs 100 --eval_inter 1 --wandb
python main.py --device cuda:0 --dataset ogbn_arxiv --hid_units 1500 --encoder_layer 3 --lr 1e-4 --epochs 30 --batch_size 8192 --batch_train --eval_inter 1 --wandb
tips: remove "--wandb" to disable wandb logging if logging error happened.
Table 1. Clustering performance (%) of our method and fourteen state-of-the-art baselines. The bold and underlined values are the best and the runner-up results. “OOM” indicates that the method raises the out-of-memory failure. “-” denotes that the methods do not converge.
Figure 1. t-SNE visualization of seven methods on the Cora dataset.
Our code are partly based on the following GitHub repository. Thanks for their awesome works.
- Awesome Deep Graph Clustering: a collection of deep graph clustering (papers, codes, and datasets).
- Graph-Group-Discrimination: the official implement of Graph Group Discrimination (GGD) model.
- S3GC: the official implement of Scalable Self-Supervised Graph Clustering (S3GC) model.
- HSAN: the official implement of Hard Sample Aware Network (HSAN) model.
- SCGC: the official implement of Simple Contrastive Graph Clustering (SCGC) model.
- DCRN: the official implement of Dual Correlation Reduction Network (DCRN) model.
pretrain Dink-Net on your own dataset. Refer to here.
If you find this repository helpful, please cite our paper.
@inproceedings{Dink-Net,
title={Dink-Net: Neural Clustering on Large Graphs},
author={Liu, Yue and Liang, Ke and Xia, Jun and Zhou, Sihang and Yang, Xihong and Liu, Xinwang and Li, Stan Z.},
booktitle={International Conference on Machine Learning},
year={2023},
organization={PMLR}
}