Hierarchical Clustering Questions
sven-h opened this issue · 17 comments
Hi,
thank you for the really nice and helpful library,
I have a few questions about hierarchical clustering:
-
I would like to transform the
PointerHierarchy(Representation)Result
to the data structure used by e.g. the scipy linkage function such that the information about the merges are easily accessible. Am I right, that I first need to use the parent pointers and in case multiple DBIDs point to the same parent, then using the parent distance? In thetopologicalSort
function the merge order is also used to further make the distinction in case it has the same distance. Correct? The only possibility to access themergeOrder
is to place a class in the same namespace. Is this the way to go? -
I'm further interested in the merge hierarchy of datasets larger than 65,535 instances. I also have functions to calculate the distance matrix in parallel (and more efficiently for special distances such as euclidean by using the BLAS library). As far as I can see, this would require a whole rewrite of the algorithms instead of just changing the
MatrixParadigm
class. Would such a rewrite be useful for the library or what other options do I have?
Thanks a lot
Best regards
Sven
- The pointer hierarchy comes from the (fast) SLINK algorithm, but nevertheless it is on my wishlist to switch to the scipy-style list structure, which makes some postprocessing easier.
- Such a transformation can be found in some of the clustering extraction classes, e.g., ClustersWithNoiseExtraction
- Except for single-linkage with SLINK, almost all of these methods need to maintain a pairwise distance matrix of the clusters - initially of all points. Rewriting this to a "ragged" matrix is not too hard - which would make this limited largely by your memory and run time. A rewrite using a java Buffer instead of an array (which is indexed with a long) could be both easier and more efficient. But these methods are simply not very scalable, and I don't think it is the best way to proceed. For special cases - in particular single-link, but also Ward linkage - things can be optimized better. But this also is not included in the release yet. When working with data of this size, using some form of data aggregation (as done, e.g., in BIRCH/BETULA) most likely is desirable to not get quadratic-to-cubic runtime.
What data size are you interested in, what is the current runtime at 2^16 instances - and how much runtime are you willing to invest?
Hi,
thank you for your very fast answer.
- I will have a look at
ClustersWithNoiseExtraction
and see how far I can get. - I quickly ran a test with 60,000 instances and it takes 151 seconds. My target is around 150,000 instances but I also have enough time (up to several days) and main memory (up to 1TB). Only the java array size (implementation specific) is currently the limiting factor.
- Are you referring to a java
DoubleBuffer
(because at least the interface looks like it can only handle the same amount as arrays)? I thought the ragged array is also a good data structure especially for such algorithms. On the other hand, a custom Buffer indexed by a long such as this one on stackoverflow would work as well and require less rewriting (as you already mentioned). Do you think a ragged array or a custom buffer is the better way to go?
I thought DoubleBuffer would be long indexed, but apparently I am wrong. Maybe I had the Unsafe class in mind, which has getDouble(long address)
. Either way, if you are using BLAS to compute the distance matrix, it probably already originates off-heap. Then directly using this memory may save you making a copy. But working with ragged arrays isn't too bad here, either.
ELKI already uses fastutil in a number of places, which includes a DoubleBigArrayBigList (when building a bundle jar, remove the exclude lines from addons/bundle/build.gradle, we currently exclude these classes to reduce the jar size). These classes are usually very efficient and can be recommended. This is likely the easiest one to use instead.
151 seconds isn't too bad - which algorithm and which linkage did you use?
I check the implementation of DoubleBigArrays
as well as DoubleBigArrayBigList
. The former just provides static methods to get/set values of a double[][]
. The latter one only has the advantage to add elements on the fly which is not needed here (at least from my perspective).
Most of the algorithms directly modify the double[]
in MatrixParadigm
(so this is more like a data storage). Maybe it is possible to have an interface such that the real implementation (using ragged arrays etc) is hided and only necessary methods like getter/setter etc are provided (but I think this would mean a lot of rewrite in many HAC classes).
For the simple experiments I used AnderbergHierarchicalClustering
and single linkage (the linkage is just an example and can be also any other linkage strategy).
The question is how to proceed here. Maybe I'm able to provide an implementation of AGNES (in the beginning) with this new kind of implementation but I'm not sure if I can also change the other implementations at all (I would be interested in the AnderbergHierarchicalClustering
because is a lot faster due to the efficient search for the next merge).
Anderberg is certainly the better starting point than AGNES, it is surprisingly simple (that is part of why its fast). It will not be sufficient to substitute matrixparadigm, because it writes directly into the underlying matrix (to improve performance, it exploits the triangular layout, and does not go through a get(x,y) getter that repeats certain calculations unnecessarily; so we eventually removed this abstraction).
But it may be sufficient to copy&paste "fork" these two classes. You may want to change the output format to your liking while at it, and produce such a linkage table instead of the pointer hierarchy.
I would stick to the pointer hierarchy to fulfill the requirements of your library (and just provide a transformation to the table style).
Would you accept a pull request and have a look at the proposed code such that others also profit from it?
Or do you think the requirement for higher number of examples are out of scope for ELKI?
Maybe the two versions can exist in ELKI side by side?
Still, Anderberg looks more complicated (maybe also due to the triangular layout exploits) than simple AGNES.
Our AGNES does the same optimizations wrt. to iterating the linearized diagonal form (which is also used in scipy, btw), so there is no difference there.
I'd certainly appreciate a pull request to add a "AnderbergBig" class.
It is definitely worth benchmarking at what cost it is possible to add back the abstraction that would allow replacing the MatrixParadigm class only (at which point it would be possible to dynamically switch to a BigMatrixParadigm depending on the data set size). It may well be that a "getLinear(long i)" and using long computations for the index in regular Anderberg comes at little enough cost due to Java inlining the getter and then optimizing bounds checks anyway. I'd avoid switching to a "get(int x, int y)" getter.
Although if I'd need to scale this to larger data myself, I'd likely implement Anderberg from scratch in Rust now; this will likely optimize even better and might be 2x-3x faster.
P.S. single-link is a special case, in which certain effects cannot occur, hence the scalability of other linkages could be worse (but probably not by much). Single-link can be implemented with O(n) memory, too, without such a matrix in the first place.
Branch https://github.com/elki-project/elki/tree/feature/newhac has a rewrite to a merge history representation that is likely a bit easier to use. It also uses more integer indexes rather than DBIDs (representing cluster numbers 0..2n-2) as we no longer identify clusters with the last object as in the SLINK pointer hierarchy.
This is a preparation for (hopefully) an even faster HAC to come in a year or two (when researched, implemented, and published). That one hopefully will also get rid of any 65k size limit.
I will also rewrite and merge an NNChain variant that does not need a pairwise distance matrix (and hence use only linear memory), which already exists in a separate private branch.
If above link does not work anymore, the branch was merged into main.
That sounds cool. Thank you for the information.
I will have a look at it.
I have merged the branch into main. It was 20% faster (for AGNES, Anderberg) in a brief test.
I did not remove the 65k limit yet, but it should be easier than before.
Great, thanks a lot.
One more question: Do you also plan to release a new version to maven central in the near future?
I do want to make a new release because of the many new features in ELKI, but usually I want these releases to come along with a supporting publication to make them easier to cite; this will need some time and preparation.
There are some todos for the next release I did not yet have the time to tackle: use some of the newer Java language features such as "var" types for readability, rewrite the logging layer to allow using, e.g., slf4j, log4j (but we have these progress bars implemented via logging, that will be more difficult to support efficiently when using some logging abstraction). etc.
A new release (without the logging change) has been submitted as a demo paper, and I hope to release 0.8.0 end of the summer.
ok, great. Thank you for the information.
The demo paper has been accepted, it will appear at SISAP 2022 in Bologna, October 5-7.
The ELKI 0.8.0 release will be prior to the conference.
c.f., https://sisap.org/2022/papers.html
Congrats.
Good to hear that a new release will be prepared.
ELKI 0.8.0 is on maven, but with a new artifact group id, "io.github.elki-project".