Can I force approximate_predict to assign every embedding to an existing cluster?

Question

Can I force approximate_predict to assign every embedding to an existing cluster?

mirix opened this issue a year ago · 3 comments

Hello,

Let me see if I am understanding things correctly.

I am reducing dimensionality with UMAP:

		clusterable_embedding_large = umap.UMAP(
		    n_neighbors=n_neighbors,
		    min_dist=.0,
		    n_components=comp,
		    random_state=31416,
		    metric='cosine'
		).fit_transform(df_dist)

Then I split the UMAP embeddings according to predefined indexes (between long and short sentences):

		cel_long = clusterable_embedding_large[long_seg]
		cel_shor = clusterable_embedding_large[shor_seg]

Then I cluster the long sentences only:

		clusterer = hdbscan.HDBSCAN(
		    min_samples=1,
		    min_cluster_size=cluster_size,
		    #cluster_selection_method='eom',
		    cluster_selection_method='leaf',
		    cluster_selection_epsilon=5,
		    gen_min_span_tree=True,
		    prediction_data=True
		).fit(cel_long)

Next I would like to assign each of the short sentences to one of the pre-existing clusters:

		labels = list(clusterer.labels_)
		labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor)
		labels_short = list(labels_short)
		
		print(labels)
                [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]
		print(labels_short)
               [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1]

However, I face two issues:

Some points are not assigned (label -1).
Some points are assigned to a new cluster which did not exist in the original clustering (label 2).

The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what?

On the other hand, I believe that the second issue was not possible. From the docs:

With that done you can run [approximate_predict()](https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model.

Can this be also avoided?

Best,

Ed

Answer 1 · 2023-07-05T13:40:34.000Z

I think you want to try the soft clustering options to manage to do that.

…

On Wed, Jul 5, 2023 at 8:17 AM mirix ***@***.***> wrote: Hello, Let me see if I am understanding things correctly. I am reducing dimensionality with UMAP: clusterable_embedding_large = umap.UMAP( n_neighbors=n_neighbors, min_dist=.0, n_components=comp, random_state=31416, metric='cosine' ).fit_transform(df_dist) Then I split the UMAP embeddings according to predefined indexes (between long and short sentences): cel_long = clusterable_embedding_large[long_seg] cel_shor = clusterable_embedding_large[shor_seg] Then I cluster the long sentences only: clusterer = hdbscan.HDBSCAN( min_samples=1, min_cluster_size=cluster_size, #cluster_selection_method='eom', cluster_selection_method='leaf', cluster_selection_epsilon=5, gen_min_span_tree=True, prediction_data=True ).fit(cel_long) Next I would like to assign each of the short sentences to one of the pre-existing clusters: labels = list(clusterer.labels_) labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor) labels_short = list(labels_short) print(labels) [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1] print(labels_short) [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1] However, I face two issues: 1. Some points are not assigned (label -1). 2. Some points are assigned to a new cluster which did not exist in the original clustering (label 2). The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what? On the other hand, I believe that the second issue was not possible. From the docs: With that done you can run [approximate_predict()]( https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model. Can this be also avoided? Best, Ed — Reply to this email directly, view it on GitHub <#599>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUBI7MUOLSMMA4ZRFOJDXOVLMDANCNFSM6AAAAAAZ63ZLOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2023-07-05T14:20:35.000Z

Thanks, it seems promising. I will look into that.

In the meantime, I have found a workaround:

I cluster all the points together as usual. Then, for each short sentence, I compute the average distance from each cluster (excluding short sentences) and reassign if required.

This seems to solve the problem on the current dataset.

Answer 3 · 2023-07-06T07:07:54.000Z

In case your are interested, HDBSCAN works wonderfully for clustering speakers in a diarisation project:

https://github.com/mirix/approaches-to-diarisation

I am really impressed. The challenge now would be to come up with some heuristics or ML to guess the optimal parameters automatically.