Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering with average linkage #73

Open
bobermayer opened this issue Jun 20, 2022 · 1 comment
Open

clustering with average linkage #73

bobermayer opened this issue Jun 20, 2022 · 1 comment

Comments

@bobermayer
Copy link

Hi,

I've come across the preprint on hierarchical clustering (https://arxiv.org/abs/2106.05610), and this method looks seems to be exactly what I need.
I managed to install the gbbs library using bazel and also to run HierarchicalAgglomerativeClustering using the python bindings, but only for single and complete linkage.
from HAC_api.h in benchmarks/Clustering/SeqHAC it looks like these are the only ones exported to the API, but an earlier commit (1ecf43c) used to have the other methods in there. I failed to successfully use the library on that commit, though, because graph input and output changed and HierarchicalAgglomerativeClustering is not available as method.
I also did not manage to include the other linkage options in HAC_api.h, apparently because the call signatures changed somewhat in between.

is there a way to make average linkage clustering available via python bindings, or alternatively from the command line? I did not understand how to run the clustering this way.

any help would be greatly appreciated!

Thanks!

@bobermayer
Copy link
Author

bobermayer commented Jul 5, 2022

ok, so for future reference: I forked the repo and adapted the CLI to accept floating point weighted adjacency graphs. then I can get average linkage via the CLI.
however, even for single linkage there's some disagreement between the CLI and the python bindings that I've been unable to resolve. for a dense graph in mtx format I'm getting the same result with the python bindings as with fastcluster.linkage

import scipy.io
import numpy as np
sys.path.append(os.path.join(os.getcwd(),'bazel-bin','pybindings'))
import gbbs
adj=scipy.io.mmread('graph_mtx.mtx')
nz=adj.nonzero()
m=np.vstack((nz[0],nz[1],adj.data)).T
G=gbbs.numpyFloatEdgeListToSymmetricWeightedGraph(np.ascontiguousarray(m))
L=G.HierarchicalAgglomerativeClustering(linkage,False)
G.writeGraph('graph_gbbs.txt')

this is basically identical to fastcluster.linkage with the dense distance matrix (different ordering, but the same clusterings at the same distance thresholds).
however, the CLI gives a slightly different result

./bazel-bin/benchmarks/Clustering/SeqHAC/HACDissimilarity -s -of linkage.txt -linkage single graph_gbbs.txt

I have no idea how this can happen. any help still appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant