/pygkernels

Graphs clustering using kernel measures and estimators. Kernel KMeans, Spectral Clustering, Kernel Ward etc.

Primary LanguagePythonMIT LicenseMIT

Build Status codecov

pygkernels — Kernels on Graphs for Python

Framework for clustering graph nodes using various similarity/dissimilarity measures.

List of measures:

  • Adjacency matrix based kernels:
    • Katz: Katz kernel (a.k.a. Walk, Von Neumann diffusion kernel)
    • Comm: Communicability kernel (a.k.a. Exponential diffusion kernel)
    • DFS: Double Factorial similarity
  • Laplacian based kernels:
    • For: Forest kernel (a.k.a. Regularized Laplacian kernel)
    • Heat: Heat kernel (a.k.a. Laplacian exponential diffusion kernel)
    • NHeat: Normalized Heat kernel
    • Abs: Absorption kernel
  • Markov matrix based kernels and measures:
    • PPR: Personalized PageRank
    • MPPR: Modified Personalized PageRank
    • HPR: PageRank heat similarity measure
    • RSP: Randomized Shortest Path distance
    • FE: Free Energy distance
  • Sigmoid Commute Time:
    • SCT: Sigmoid Commute Time
    • CCT: Corrected Commute Time
    • SCCT: Sigmoid Corrected Commute Time
  • SP-CT: Shortest Path, Commute Time, and their linear combination
  • logarithmic version of every measure

Every measure is presented as dissimilarity (distance) and similarity (kernel/proximity) measure. All of them can be used in any classification/clustering/community detection algorithm which uses kernel trick (e.g. kernel k-means).

List of clustering algoritms:

  • Kernel k-means (GPU support, powered with pytorch)
  • Spectral clustering
  • Ward clustering
  • Wrappers for kernel k-means from kernlab, sklearn k-means

Graph generators:

  • Stochastic Block Model
  • LFR (networkx wrapper)

Graph datsets:

https://github.com/vlivashkin/community-graphs

Install

pip install git+https://github.com/vlivashkin/pygkernels.git

Usage

Simple clustering:

import networkx as nx
from pygkernels.cluster import KKMeans
from pygkernels.measure import logComm_H

G = nx.read_gml('news_2cl1.gml')
A = nx.adjacency_matrix(G).toarray()

estimator = KKMeans(n_clusters=2)
K = logComm_H(A).get_K(param=0.1)
partition = estimator.predict(K, A=A)

Grid search measure parameters:

import numpy as np
from sklearn.metrics import adjusted_rand_score
from pygkernels.cluster import KKMeans
from pygkernels.data import StochasticBlockModel
from pygkernels.measure import logComm_H
from pygkernels.scenario import ParallelByGraphs

n_graphs, n, k, p_in, p_out = 100, 100, 2, 0.3, 0.1  # params for SBM graph generator
n_params = 30  # grid search through n_params in [0, 1] space

graphs, _ = StochasticBlockModel(n, k, p_in=p_in, p_out=p_out).generate_graphs(n_graphs)
gridsearch = ParallelByGraphs(adjusted_rand_score, n_params, progressbar=True, ignore_errors=True)

params, ari, error = gridsearch.perform(KKMeans, logComm_H, graphs, k, n_jobs=-1, n_gpu=2)
best_param = params[np.argmax(ari)]

Generate LFR graphs like datasets:

from pygkernels.data import Datasets, LFRGenerator

(A, partition), info = Datasets()['news_2cl1_0.1']
gen = LFRGenerator.params_from_adj_matrix(A, partition, info['k'])
print(gen.generate_info())
A, partition = gen.generate_graph()
gen = LFRGenerator.params_from_adj_matrix(A, partition, info['k'])
print(gen.generate_info())

All examples are located in ./examples. More usage approaches are in a separate repo https://github.com/vlivashkin/pygkernels-experiments

Citation

@inproceedings{ivashkin2021dissecting,
  title={Dissecting graph measure performance for node clustering in LFR parameter space},
  author={Ivashkin, Vladimir and Chebotarev, Pavel},
  booktitle={International Conference on Complex Networks and Their Applications},
  pages={328--341},
  year={2021},
  organization={Springer}
}