relatio-nlp/relatio

Dimension reduction of embeddings before clustering

Closed this issue · 2 comments

We are experimenting with using PCA and UMAP to dimension-reduce the entity embeddings before clustering.

Running full PCA on the whole set of entities could be computationally intensive. Two solutions:

for UMAP, probably have to use the whole dataset, or run it on a sample and assign to full dataset.

The fit() method is separate from the predict() method in NarrativeModel(), so one can always fit a model on a sample and predict out-of-sample.

If the sample is based on millions of sentences, then the dimension reduction algorithms (PCA, UMAP, and the cluster model) will likely be computationally too intensive with the default settings.

However, the user can specify pca_args and umap_args to reduce the computational burden. For example, sklearn.decomposition.PCA has an argument svd_solver, and some solvers are stochastic and can handle larger datasets.

Of course, KMeans and HDBSCAN could still be bottlenecks. One option would be to also provide Mini-batch KMeans as a clustering model of last resort.

PS: For now, my impression is that users have relatively small datasets (in the tens to hundreds of thousands of sentences).

Done in relatio v0.3