OvpNMI - Overlapping NMI

An implementation of a Normalized Mutual Information (NMI) measure for sets of overlapping clusters.
ATTENTION: does not suitable to evaluate multi-resolution or highly overlapping clusterings because of the used "best match" approximation (see formula 3 in the paper). To evaluate such cases GenConvNMI can be applied, which works slower, but suitable for all cases.

The paper: "Normalized Mutual Information to evaluate overlapping community finding algorithms" by Aaron F. McDaid, Derek Greene, Neil Hurley
This method is based on the method described in Appendix B at the end of: "Detecting the overlapping and hierarchical community structure in complex networks" by Andrea Lancichinetti, Santo Fortunato and János Kertész

Author: Aaron F. McDaid aaronmcdaid@gmail.com

This is a fork of the original onmi with the extension purposes to be used in the PyCaBeM clustering benchmark. The extendsion and bugfixes include: modification of the I/O, node base synchronization, NMI_sqrt added, normalization border cases fixed (the case of fully overlapping multiple clusters), etc.
Changes made by Artem Lutov artem@exascale.info

Deployment

Dependencies

There no any dependencies for the execution or compilation.
However, to extend the input options and automatically regenerate the input parsing, gengetopt application should be installed: $ sudo apt-get install gengetopt.

Compilation

Just execute $ make.
To update/extend the input parameters just modify args.ggo and run GenerateArgparser.sh (calls gengetopt).

Usage

$ onmi clsfile1 clsfile2

Applicability Note: OvpNMI is extremely fast, but does not suitable to evaluate multi-resolution clusterings, see GenConvNMI instead.

Execution Options:

$ ./onmi -h
onmi 0.3

Compare sets of clusters by their members (nodes) using various measures (NMI,
Omega) and considering overlaps

Usage: onmi [OPTIONS] clsfile1 clsfile2

  -h, --help              Print help and exit
  -V, --version           Print version and exit
  -s, --sync              synchronize the node base, for example to fairly
                            evaluate against the top K selected clusters that
                            are subset of the original nodes  (default=off)
  -a, --allnmis           output all NMIs (sqrt and sum-denominators, LFK besides the
                            max-denominator)  (default=off)
  -m, --membership=FLOAT  average expected membership of nodes in the clusters,
                            > 0, typically >= 1  (default=`1')
  -o, --omega             print the Omega measure (can be slow)  (default=off)
  -t, --textid            use text ids of nodes instead of .cnl format
                            (default=off)
  -v, --verbose           detailed debugging  (default=off)

The input files contain list of clusters (communities, modules). A typical use case is to have the "true" communities in one file and and those found by your algorithm in the other file.

The default input file format is CNL (cluster nodes list), where each cluster is represented by one line. The nodes are separated by whitespace, and any non-whitespace characters may be used in the node names. Empty lines and comments (lines starting with #) are skipped. Example of the CNL format:

# The comments start with '#' like this line
# Each non-commented line is a module(cluster, community) consisting of the the member nodes separated by space / tab
1
1 2
2

A node id is unsigned integer by default, and it can be any word not starting with the comment symbol # if -t option is specified to use text ids.

Any line starting with # is omitted as a comment, also as any remained part of the line starting with # in the textid mode
Ids can't contain : symbol, because it is used to specify the membership share in the CNL format, which is not supported by onmi. The id part starting from the : symbol is omitted (trimmed).

Note: Please, star this project if you use it.

Related Projects

xmeasures - Extrinsic clustering measures evaluation for the multi-resolution clustering with overlaps (covers): F1_gm for overlapping multi-resolution clusterings with possible unequal node base and standard NMI for non-overlapping clustering on a single resolution.
GenConvNMI - Overlapping NMI evaluation that is (unlike onmi) compatible with the original NMI and suitable for both overlapping and multi resolution (hierarchical) clusterings.
resmerge - Resolution levels clustering merger with filtering. Flattens hierarchy/list of multiple resolutions levels (clusterings) into the single flat clustering with clusters on various resolution levels synchronizing the node base.
ExecTime - A lightweight resource consumption profiler.
PyCABeM - Python Benchmarking Framework for the Clustering Algorithms Evaluation. Uses extrinsic (NMIs) and intrinsic (Q) measures for the clusters quality evaluation considering overlaps (nodes membership by multiple clusters).
TInfES - Type inference evaluation scripts and accessory apps used for the benchmarking.