An implementation of a Normalized Mutual Information (NMI) measure for sets of overlapping clusters.
ATTENTION: does not suitable to evaluate multi-resolution or highly overlapping clusterings because of the used "best match" approximation (see formula 3 in the paper). To evaluate such cases GenConvNMI can be applied, which works slower, but suitable for all cases.
The paper: "Normalized Mutual Information to evaluate overlapping community finding algorithms" by Aaron F. McDaid, Derek Greene, Neil Hurley
This method is based on the method described in Appendix B at the end of:
"Detecting the overlapping and hierarchical community structure in complex networks"
by Andrea Lancichinetti, Santo Fortunato and János Kertész
Author: Aaron F. McDaid aaronmcdaid@gmail.com
This is a fork of the original onmi
with the extension purposes to be used in the PyCaBeM clustering benchmark. The extendsion and bugfixes include: modification of the I/O, node base synchronization, NMI_sqrt added, normalization border cases fixed (the case of fully overlapping multiple clusters), etc.
Changes made by Artem Lutov artem@exascale.info
There no any dependencies for the execution or compilation.
However, to extend the input options and automatically regenerate the input parsing,
gengetopt application should be installed: $ sudo apt-get install gengetopt
.
Just execute $ make
.
To update/extend the input parameters just modify args.ggo
and run GenerateArgparser.sh
(calls gengetopt
).
$ onmi clsfile1 clsfile2
Applicability Note: OvpNMI is extremely fast, but does not suitable to evaluate multi-resolution clusterings, see GenConvNMI instead.
Execution Options:
$ ./onmi -h
onmi 0.3
Compare sets of clusters by their members (nodes) using various measures (NMI,
Omega) and considering overlaps
Usage: onmi [OPTIONS] clsfile1 clsfile2
-h, --help Print help and exit
-V, --version Print version and exit
-s, --sync synchronize the node base, for example to fairly
evaluate against the top K selected clusters that
are subset of the original nodes (default=off)
-a, --allnmis output all NMIs (sqrt and sum-denominators, LFK besides the
max-denominator) (default=off)
-m, --membership=FLOAT average expected membership of nodes in the clusters,
> 0, typically >= 1 (default=`1')
-o, --omega print the Omega measure (can be slow) (default=off)
-t, --textid use text ids of nodes instead of .cnl format
(default=off)
-v, --verbose detailed debugging (default=off)
The input files contain list of clusters (communities, modules). A typical use case is to have the "true" communities in one file and and those found by your algorithm in the other file.
The default input file format is CNL (cluster nodes list), where each cluster is represented by one line. The nodes are separated by whitespace, and any non-whitespace characters may be used in the node names. Empty lines and comments (lines starting with #) are skipped. Example of the CNL format:
# The comments start with '#' like this line
# Each non-commented line is a module(cluster, community) consisting of the the member nodes separated by space / tab
1
1 2
2
A node id is unsigned integer by default, and it can be any word not starting with the comment symbol #
if -t
option is specified to use text ids.
- Any line starting with
#
is omitted as a comment, also as any remained part of the line starting with#
in the textid mode - Ids can't contain
:
symbol, because it is used to specify the membership share in the CNL format, which is not supported by onmi. The id part starting from the:
symbol is omitted (trimmed).
Note: Please, star this project if you use it.
- xmeasures - Extrinsic clustering measures evaluation for the multi-resolution clustering with overlaps (covers): F1_gm for overlapping multi-resolution clusterings with possible unequal node base and standard NMI for non-overlapping clustering on a single resolution.
- GenConvNMI - Overlapping NMI evaluation that is (unlike
onmi
) compatible with the original NMI and suitable for both overlapping and multi resolution (hierarchical) clusterings. - resmerge - Resolution levels clustering merger with filtering. Flattens hierarchy/list of multiple resolutions levels (clusterings) into the single flat clustering with clusters on various resolution levels synchronizing the node base.
- ExecTime - A lightweight resource consumption profiler.
- PyCABeM - Python Benchmarking Framework for the Clustering Algorithms Evaluation. Uses extrinsic (NMIs) and intrinsic (Q) measures for the clusters quality evaluation considering overlaps (nodes membership by multiple clusters).
- TInfES - Type inference evaluation scripts and accessory apps used for the benchmarking.