Merges multiple clustering resolutions / hierarchy levels into the single flat collection, i.e. flattens the input hierarchy / resolutions specified by the files and / or directories. Also, can be used to clean up a single clustering (collection of clusters) deduplicating and optionally filtering out clusters and nodes by the specified criteria.
Only the unique clusters independent of the nodes order are saved into the output file with optional filtering by the clusters size and node base synchronization. The order of nodes of in the input clusters is retained. The execution is extremely fast, O(N).
resmerge
is one of the utilities designed for the PyCaBeM clustering benchmark.
Author (c) Artem Lutov artem@exascale.info
There no any requirements for the execution or compilation except the standard C++ library.
However, to extend the input options and automatically regenerate the input parsing,
gengetopt application should be installed: $ sudo apt-get install gengetopt
.
For the prebuilt executable on Linux Ubuntu 16.04 x64: $ sudo apt-get install libstdc++6
.
Just execute $ make
.
To update/extend the input parameters modify args.ggo
and run GenerateArgparser.sh
(calls gengetopt
).
Build errors might occur if the default g++/gcc <= 5.x.
Theng++-5
should be installed andMakefile
might need to be edited replacingg++
,gcc
withg++-5
,gcc-5
.
Execution Options:
$ ./resmerge -h
resmerge 1.2
Merge multiple clusterings (resolution/hierarchy levels) outputting only the
unique clusters with the optional their filtering by the size and nodes
filtering by the specified base.
Usage: resmerge [OPTIONS] clusterings...
clusterings... - clusterings specified by the given files and directories
(non-recursive traversing)
-h, --help Print help and exit
-V, --version Print version and exit
-o, --output=STRING output file name. If a single directory <dirname> is
specified then the default output file name is
<dirname>.cnl.
NOTE: the number of nodes is written to the output
file only if the node base synchronization is
applied, otherwise 0 is set
(default=`clusters.cnl')
-r, --rewrite rewrite already existing resulting file or skip the
processing (default=off)
-b, --btm-size=LONG bottom margin of the cluster size to process
(default=`0')
-t, --top-size=LONG top margin of the cluster size to process
(default=`0')
-m, --membership=FLOAT average expected membership of the nodes in the
clusters, > 0, typically >= 1 (default=`1')
Mode: sync
Synchronize the node base of the merged clustering
-s, --sync-base=STRING synchronize node base with the specified collection
Mode: exrtact
Extract the node base from the specified clustering(s)
-e, --extract-base extract the node base from the clusterings instead of
merging the clusterings (default=off)
Examples
Merge clusterings (resolution levels) from the <dirname>
to <dirname>.cnl
:
$ ./resmerge /opt/tests/tmp/resolutions
Deduplicate a single clustering:
$ ./resmerge communs/com-dblp.all.cmty.txt -o communs/com-dblp.all.cmty.dedub.cnl
Extract node base to <filename>_base.cnl
:
./resmerge -e /opt/tests/collection.cnl
Merge clusterings, synchronize the node base and output resulting flattened hierarchy/levels to the specified file:
$ ./resmerge -s /opt/tests/levels_nodebase.cnl -o /opt/tests/flatlevs_synced.cnl /opt/tests/levels/ /opt/tests/level_extra.cnl
- Clubmark - A parallel isolation framework for benchmarking and profiling clustering (community detection) algorithms considering overlaps (covers).
- xmeasures - Extrinsic quality (accuracy) measures evaluation for the overlapping clustering on large datasets: family of mean F1-Score (including clusters labeling), Omega Index (fuzzy version of the Adjusted Rand Index) and standard NMI (for non-overlapping clusters).
- GenConvNMI - Overlapping NMI evaluation that is compatible with the original NMI and suitable for both overlapping and multi resolution (hierarchical) clustering evaluation.
- ExecTime - A lightweight resource consumption profiler.
Note: Please, star this project if you use it.