scikit-learn-contrib/hdbscan

Can I force approximate_predict to assign every embedding to an existing cluster?

mirix opened this issue · 3 comments

mirix commented

Hello,

Let me see if I am understanding things correctly.

I am reducing dimensionality with UMAP:

		clusterable_embedding_large = umap.UMAP(
		    n_neighbors=n_neighbors,
		    min_dist=.0,
		    n_components=comp,
		    random_state=31416,
		    metric='cosine'
		).fit_transform(df_dist)

Then I split the UMAP embeddings according to predefined indexes (between long and short sentences):

		cel_long = clusterable_embedding_large[long_seg]
		cel_shor = clusterable_embedding_large[shor_seg]

Then I cluster the long sentences only:

		clusterer = hdbscan.HDBSCAN(
		    min_samples=1,
		    min_cluster_size=cluster_size,
		    #cluster_selection_method='eom',
		    cluster_selection_method='leaf',
		    cluster_selection_epsilon=5,
		    gen_min_span_tree=True,
		    prediction_data=True
		).fit(cel_long)

Next I would like to assign each of the short sentences to one of the pre-existing clusters:

		labels = list(clusterer.labels_)
		labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor)
		labels_short = list(labels_short)
		
		print(labels)
                [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]
		print(labels_short)
               [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1]

However, I face two issues:

  1. Some points are not assigned (label -1).

  2. Some points are assigned to a new cluster which did not exist in the original clustering (label 2).

The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what?

On the other hand, I believe that the second issue was not possible. From the docs:

With that done you can run [approximate_predict()](https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model.

Can this be also avoided?

Best,

Ed

mirix commented

Thanks, it seems promising. I will look into that.

In the meantime, I have found a workaround:

I cluster all the points together as usual. Then, for each short sentence, I compute the average distance from each cluster (excluding short sentences) and reassign if required.

This seems to solve the problem on the current dataset.

mirix commented

In case your are interested, HDBSCAN works wonderfully for clustering speakers in a diarisation project:

https://github.com/mirix/approaches-to-diarisation

I am really impressed. The challenge now would be to come up with some heuristics or ML to guess the optimal parameters automatically.