scikit-learn-contrib/hdbscan

Validation questions

mdagost opened this issue · 1 comments

I'm using both relative_validity_ and the full validity_index function from hdbscan.validity. @lmcinnes if they give different optimal parameters, is there a reason to prefer one over the other? Perhaps validity_index because the other one is approximate?

My application is in NLP clustering of embedding vectors, and one of the things I'm testing are different embedding vectors with different dimensionalities. Is it valid to use either of those metrics to compare across embeddings for the same dataset, or only across the hdbscan parameters themselves?

Thank you so much!

Just thought I'd bump this if you have any thoughts, out of the kindness of your heart @lmcinnes :)