Dimension reduction of embeddings before clustering
Closed this issue · 2 comments
We are experimenting with using PCA and UMAP to dimension-reduce the entity embeddings before clustering.
Running full PCA on the whole set of entities could be computationally intensive. Two solutions:
- incremental PCA including all duplicates
- limit to unique entity strings but use weighted PCA (https://github.com/jakevdp/wpca)
for UMAP, probably have to use the whole dataset, or run it on a sample and assign to full dataset.
The fit()
method is separate from the predict()
method in NarrativeModel()
, so one can always fit a model on a sample and predict out-of-sample.
If the sample is based on millions of sentences, then the dimension reduction algorithms (PCA
, UMAP
, and the cluster model) will likely be computationally too intensive with the default settings.
However, the user can specify pca_args
and umap_args
to reduce the computational burden. For example, sklearn.decomposition.PCA
has an argument svd_solver
, and some solvers are stochastic and can handle larger datasets.
Of course, KMeans and HDBSCAN could still be bottlenecks. One option would be to also provide Mini-batch KMeans as a clustering model of last resort.
PS: For now, my impression is that users have relatively small datasets (in the tens to hundreds of thousands of sentences).
Done in relatio v0.3