clustering with average linkage
bobermayer opened this issue · 1 comments
Hi,
I've come across the preprint on hierarchical clustering (https://arxiv.org/abs/2106.05610), and this method looks seems to be exactly what I need.
I managed to install the gbbs
library using bazel and also to run HierarchicalAgglomerativeClustering
using the python bindings, but only for single
and complete
linkage.
from HAC_api.h
in benchmarks/Clustering/SeqHAC
it looks like these are the only ones exported to the API, but an earlier commit (1ecf43c) used to have the other methods in there. I failed to successfully use the library on that commit, though, because graph input and output changed and HierarchicalAgglomerativeClustering
is not available as method.
I also did not manage to include the other linkage options in HAC_api.h
, apparently because the call signatures changed somewhat in between.
is there a way to make average linkage clustering available via python bindings, or alternatively from the command line? I did not understand how to run the clustering this way.
any help would be greatly appreciated!
Thanks!
ok, so for future reference: I forked the repo and adapted the CLI to accept floating point weighted adjacency graphs. then I can get average linkage via the CLI.
however, even for single linkage there's some disagreement between the CLI and the python bindings that I've been unable to resolve. for a dense graph in mtx format I'm getting the same result with the python bindings as with fastcluster.linkage
import scipy.io
import numpy as np
sys.path.append(os.path.join(os.getcwd(),'bazel-bin','pybindings'))
import gbbs
adj=scipy.io.mmread('graph_mtx.mtx')
nz=adj.nonzero()
m=np.vstack((nz[0],nz[1],adj.data)).T
G=gbbs.numpyFloatEdgeListToSymmetricWeightedGraph(np.ascontiguousarray(m))
L=G.HierarchicalAgglomerativeClustering(linkage,False)
G.writeGraph('graph_gbbs.txt')
this is basically identical to fastcluster.linkage
with the dense distance matrix (different ordering, but the same clusterings at the same distance thresholds).
however, the CLI gives a slightly different result
./bazel-bin/benchmarks/Clustering/SeqHAC/HACDissimilarity -s -of linkage.txt -linkage single graph_gbbs.txt
I have no idea how this can happen. any help still appreciated!