How to set cluster_selection_epsilon when using cosine distances?
Opened this issue · 0 comments
ma9o commented
Hi, I am using HDBSCAN to cluster text embeddings.
As the data is unbalanced in favor of one category of embeddings, I am obtaining too many sub-clusters of that category, which I would like to squash together. I have found that datapoints with a cosine distance <0.7 should belong in the same cluster, and if I understand correctly I should set cluster_selection_epsilon=0.7
to achieve this outcome.
This doesn't seem to be working as all the datapoints and up in the same cluster (the value is too high?).
My current code:
from cuml.metrics import pairwise_distances
from hdbscan import HDBSCAN
import numpy as np
import cupy as cp
import cuml
embeddings_gpu = cp.asarray(embeddings)
umap_model = cuml.UMAP(n_neighbors=15,
n_components=100,
metric='cosine')
reduced_data_gpu = umap_model.fit_transform(embeddings_gpu)
cosine_dist = pairwise_distances(reduced_data_gpu, metric='cosine')
clusterer = HDBSCAN(min_cluster_size=5,
gen_min_span_tree=True,
metric="precomputed",
cluster_selection_epsilon=0.7)
cluster_labels = clusterer.fit_predict(cosine_dist.astype(np.float64).get())
cluster_labels:
Shape: 9533
array([0, 0, 0, ..., 0, 0, 0])
cosine_dist:
Shape: (9533, 9533)
array([[5.9604645e-07, 1.6956329e-02, 5.4422319e-02, ..., 1.0555809e+00,
1.1026136e+00, 1.3615031e+00],
...,
[1.3615031e+00, 1.4514638e+00, 1.3940278e+00, ..., 3.1383842e-01,
7.0653200e-02, 5.9604645e-07]], dtype=float32)
Is this the correct use of cluster_selection_epsilon
? Thanks