feature request: SNN clustering
maia opened this issue ยท 7 comments
I'd like to add a feature request for SNN Clustering (Shared Nearest Neighbors) and possibly also HDBSCAN.
The advantage of SNN over DBSCAN is to be able to identify clusters with different densities, and also it does not need to be provided a fixed number of clusters as k-means does. Also its way of identifying similarity between items has advantages over euclidian distance when working with a higher number of dimensions.
HDBSCAN is an interesting improvement over DBSCAN, as it only requires one hyperparameter, and reports a probability of assignment to a cluster (which can be used to optimize the minPts hyperparameter). Towards Data Science: How to cluster in High Dimensions has an interesting overview including possible advantages of HDBSCAN and SNN (in the variant SNN-cliq).
Both algorithms have no Ruby implementation yet, as far to my knowledge.
PS: thanks for all your work on Rumale so far, it's greatly appreciated.
Thank you for your interesting suggestion. I will try to implement SNN Clustering ๐ช Because the algorithm of HDBSCAN is complicated, such as extracting MST, it takes time to implement the HDBSCAN. I would like to make it a future issue.
Thats great to hear! It seems that SNN should be able to use large parts of the DBSCAN algorithm, just with a different distance metric than euclidian (and in general it might be a good idea to allow switching the distance metric in DBSCAN, some people might prefer manhattan distance, or ranked distance).
As for HDBSCAN, I just found a lua implemetation jariasf/hdbscan that has a very brief MST calculation, but it might be too naive and not performing well. Unfortunately I'm not yet familiar enough with this problem.
I have implemented and released SNN clustering in version 0.13.1. In addition, DBSCAN becomes able to use arbitrary distance metric.
require 'rumale'
# SNN
estimator = Rumale::Clustering::SNN.new(n_neighbors: 10, eps: 6, metric: 'euclidean')
cluster_labels = estimator.fit_predict(x)
# DBSCAN with manhattan distance.
distance_mat = Rumale::PairwiseMetric.manhattan_distance(x)
estimator = Rumale::Clustering::DBSCAN.new(eps: 0.5, metric: 'precomputed')
cluster_labels = estimator.fit_predict(distance_mat)
The HDBSCAN implementation will probably require Ruby Extension. I need time to design the program structure. I would like to consider the implementation of HDBSCAN when implementing hierarchical clustering such as the Ward method in the future.
That's great news, thanks a lot! I can confirm SNN works as intended. I just had to figure out what exactly needs to be passed to #fit_predict (or which data types are accepted), and as I think that's not too intuitive (also: #fit, #fit_transform), that can be something new users will be challenged with.
Rumale's method name and interface refer to scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN.fit_predict). Currently, I do not plan to change the Rumale's interface.
The documentation issue is that sometimes rubydoc.info fails to parse yard documents. I uploaded the yard document generated in the local environment earlier, so please refer to it.
https://yoshoku.github.io/rumale/doc/Rumale/Clustering/SNN.html
Moreover, some blog posts about Rumale are useful.
https://dev.to/kojix2/rumale-cheat-sheet-4pg7
I think one of the best solutions is to write a detail user guide, however I do not have enough time to do that.
@yoshoku Thanks a lot. I'm aware of how much work a user guide is, so let's hope someone will be able to contribute a blog post in the near future. Rumale is really helpful in instances where one does not want a python setup on an additional server, but to perform quick tasks from within a rails app itself.
@maia I have implemented and released HDBSCAN clustering in version 0.13.4. Thank you for your request. I have learned a lot from implementing that. I will close this issue.
Example code:
require 'rumale'
samples, _ = Rumale::Dataset.make_moons(500, noise: 0.08, random_seed: 1)
outliers = (Rumale::Utils.rand_uniform([50, 2], Random.new(1)) - 0.3) * 3.5 - Numo::DFloat[0.2, 0.5]
x = Numo::NArray.vstack([samples, outliers])
hdbscan = Rumale::Clustering::HDBSCAN.new(min_samples: 5)
cluster_labels = hdbscan.fit_predict(x)