ParAlg/gbbs

clustering with average linkage

bobermayer opened this issue · 1 comments

Hi,

I've come across the preprint on hierarchical clustering (https://arxiv.org/abs/2106.05610), and this method looks seems to be exactly what I need.
I managed to install the gbbs library using bazel and also to run HierarchicalAgglomerativeClustering using the python bindings, but only for single and complete linkage.
from HAC_api.h in benchmarks/Clustering/SeqHAC it looks like these are the only ones exported to the API, but an earlier commit (1ecf43c) used to have the other methods in there. I failed to successfully use the library on that commit, though, because graph input and output changed and HierarchicalAgglomerativeClustering is not available as method.
I also did not manage to include the other linkage options in HAC_api.h, apparently because the call signatures changed somewhat in between.

is there a way to make average linkage clustering available via python bindings, or alternatively from the command line? I did not understand how to run the clustering this way.

any help would be greatly appreciated!

Thanks!

ok, so for future reference: I forked the repo and adapted the CLI to accept floating point weighted adjacency graphs. then I can get average linkage via the CLI.
however, even for single linkage there's some disagreement between the CLI and the python bindings that I've been unable to resolve. for a dense graph in mtx format I'm getting the same result with the python bindings as with fastcluster.linkage

import scipy.io
import numpy as np
sys.path.append(os.path.join(os.getcwd(),'bazel-bin','pybindings'))
import gbbs
adj=scipy.io.mmread('graph_mtx.mtx')
nz=adj.nonzero()
m=np.vstack((nz[0],nz[1],adj.data)).T
G=gbbs.numpyFloatEdgeListToSymmetricWeightedGraph(np.ascontiguousarray(m))
L=G.HierarchicalAgglomerativeClustering(linkage,False)
G.writeGraph('graph_gbbs.txt')

this is basically identical to fastcluster.linkage with the dense distance matrix (different ordering, but the same clusterings at the same distance thresholds).
however, the CLI gives a slightly different result

./bazel-bin/benchmarks/Clustering/SeqHAC/HACDissimilarity -s -of linkage.txt -linkage single graph_gbbs.txt

I have no idea how this can happen. any help still appreciated!