Dimension reduction of embeddings before clustering

Question

Dimension reduction of embeddings before clustering

Closed this issue a year ago · 2 comments

elliottash commented 2 years ago

We are experimenting with using PCA and UMAP to dimension-reduce the entity embeddings before clustering.

Running full PCA on the whole set of entities could be computationally intensive. Two solutions:

incremental PCA including all duplicates
limit to unique entity strings but use weighted PCA (https://github.com/jakevdp/wpca)

for UMAP, probably have to use the whole dataset, or run it on a sample and assign to full dataset.

Answer 1 · 2022-07-11T12:03:57.000Z

The fit() method is separate from the predict() method in NarrativeModel(), so one can always fit a model on a sample and predict out-of-sample.

If the sample is based on millions of sentences, then the dimension reduction algorithms (PCA, UMAP, and the cluster model) will likely be computationally too intensive with the default settings.

However, the user can specify pca_args and umap_args to reduce the computational burden. For example, sklearn.decomposition.PCA has an argument svd_solver, and some solvers are stochastic and can handle larger datasets.

Of course, KMeans and HDBSCAN could still be bottlenecks. One option would be to also provide Mini-batch KMeans as a clustering model of last resort.

PS: For now, my impression is that users have relatively small datasets (in the tens to hundreds of thousands of sentences).

Answer 2 · 2023-04-21T12:49:46.000Z

Done in relatio v0.3