/CommunityFitNet

This page is a companion for our paper on overfitting and underfitting of community detection methods on real networks, written by Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset. (arXiv:1802.10582)

CommunityFitNet


This page is a companion for the paper

Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset
Evaluating Overfit and Underfit in Models of Network Community Structure, IEEE TKDE 32(9), 1722-1735 (2019).

on evaluating the overfitting and underfitting of community detection methods on real networks.

CommunityFitNet provides both (i) a reference set of networks and (ii) partitions of those networks for a large set of state-of-the-art community detection algorithms (Table 1 of the paper).

The purpose of this package is to facilitate between-algorithm comparisons on a large and realistic corpus of network data sets drawn from a variety of domains and of a variety of sizes. The qualitative behavior of new community detection algorithms can be assessed by comparing their partitions to those in the reference set. To compare a new algorithm with those in our evaluation set of algorithms, a researcher can run the new algorithm on the proposed benchmark, and identify which reference algorithm has the most similar behavior, e.g., in the average number of communities found (Fig. 2b of the paper). We believe the availability of this benchmark and the results of running so many state-of-the-art algorithms on it should facilitate further advances in developing community detection algorithms.


Fig. 2b of the paper

Overfitting and Underfitting among different clustering algorithms

General algorithms like MDL, Bayesian methods and regularized-likelihood algorithms tend to perform very well under different settings and can be used as reference methods for comparing with new methods. Additionally, popular methods like Infomap and modularity tend to over-fit in practice and are thus not generally reliable, at least under link prediction (Fig. 4 of the paper). However, when these more specialized methods are paired with their preferred inputs, they tend to perform much better (Fig. 6 of the paper). Generally community detection algorithms can be categorized into two general settings of probabilistic and heuristic methods. This dichotomy can be seen in the hierarchical clustering of 16 state-of-the-art community detection algorithms (Fig. 3 of the paper).


Fig. 4 of the paper


Fig. 6 of the paper


Fig. 3 of the paper

Reference:

To appear, IEEE Trans. Knowledge and Data Engineering (TKDE) (2019),
Evaluating and Comparing Overfit in Models of Network Community Structure
Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset
( arXiv version )

Clarification about the data:

We found an indexing issue in some of the bipartite networks (all “Norwegian_Board_of_Directors_net2mode…” (111 networks — network ids = 254–364) and one network called “Aishihik_Lake_host-parasite_web_Aishihik_Lake_host-parasite_web” (1 network - network id = 0) have this indexing issue). Therefore, we provide two versions of the data. One for replication purposes only, and one for reusing the data by other researchers.

Download the corrected package:

Download Pickle Format.

This package contains the corpus of 572 real-world networks (the corrected networks are replaced with the networks in the paper) from many scientific domains drawn from the Index of Complex Networks (ICON). This corpus spans a variety of sizes and structures, with 22% social, 21% economic, 34% biological, 12% technological, 4% information, and 7% transportation graphs (Fig. 1 of the paper). In addition to the information about each network, we provide the partitions achieved by our set of chosen algorithms in our paper (except than for corrected networks) for further study and comparisons by other researchers in the field.

Download the original package (for replication purposes only):

Download Pickle Format.

Note: This data is for replication purposes only. If you want to use this data for your project the corrected version is provided above.

This package contains the corpus of 572 real-world networks (original networks have been used in our paper) from many scientific domains drawn from the Index of Complex Networks (ICON). This corpus spans a variety of sizes and structures, with 22% social, 21% economic, 34% biological, 12% technological, 4% information, and 7% transportation graphs (Fig. 1 of the paper). In addition to the information about each network, we provide the partitions achieved by our set of chosen algorithms in our paper for further study and comparisons by other researchers in the field.

Instruction for using the package:

To load the data:

import pickle  
# load the data 
infile = open('./CommunityFitNet.pickle','rb')  
df = pickle.load(infile)  

# read edge lists for all networks
df_edgelists = df['edges_id'] # column 'edges_id' in dataframe df includes the edge list 
                              # for each network 
 
# extract the edge list for the first network 
edges_orig = df_edgelists.iloc[0] # a numpy array of edge list for original graph 


Fig. 1 of the paper


Table 1 of the paper

Source of the codes and the time complexity of the algorithms:

In the following table the time Complexity and the link of the code for community detection methods have been used in the paper are provided. In the time complexity column, $N$ is the number of nodes, $E$ is the number of edges, $k$ is the number of communities, $k_{\max}$ is the maximum number of clusters considered in model selection, $T$ is the number of iterations for convergence, and $\tau$ is the mixing time of the Markov chain in corresponding algorithm. The maximum number of clusters considered in model selection, $k_{\max}$, is $O(\sqrt{N})$. Therefore, the time complexity of LRT-WB can be simplified as $O(N^{3/2} k^2)$.

Note: Here, the time complexities are provided for the algorithms and the time complexity of the codes are possibly different than the ones in the table.


Table of time complexity and the link of the codes have been used for community detection algorithms

References in the table:

[1] P.-Y. Chen. Analysis and actions on graph data. 2016.

[2] K. Hayashi, T. Konishi, and T. Kawamoto. A tractable fully Bayesian method for the stochastic block model. arXiv:1602.02256, 2016.

[3] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborova, and P. Zhang. Spectral redemption in clustering sparse networks. Proc. Natl. Acad. Sci., 110(52):20935– 20940, 2013.

[4] C. M. Le and E. Levina. Estimating the number of communities in networks by spectral methods. arXiv:1507.00827, 2015.

[5] A. Mukherjee, M. Choudhury, F. Peruani, N. Ganguly, and B. Mitra. Dynamics On and Of Complex Networks, Volume 2: Applications to Time-Varying Dynamical Systems. Springer Science & Business Media, 2013.

[6] T. P. Peixoto. Parsimonious module inference in large networks. Phys. Rev. Lett., 110(14):148701, 2013.

[7] A. Saade, F. Krzakala, and L. Zdeborova. Spectral clustering of graphs with the Bethe Hessian. In Adv. Neural Info. Proc. Sys., pages 406–414, 2014.

[8] Y. R. Wang, P. J. Bickel, et al. Likelihood-based model selection for stochastic block models. Ann. Stat., 45(2):500–528, 2017.

How to cite this work:

If you use this data in your research, please cite it as follows:

@article{ghasemian2019evaluating,
  title = {Evaluating overfit and underfit in models of network community structure},
  author = {Ghasemian, Amir and Hosseinmardi, Homa and Clauset, Aaron},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  volume = {32},
  number = {9},
  pages = {1722--1735},
  year = {2019},
  publisher = {IEEE},
  href = {https://journals.aps.org/prx/abstract/10.1103/PhysRevX.6.031005},
}