facebookresearch/swav

[Discussion] Can I use DeepClusterV2 for the sole purpose of learning to cluster unlabelled dataset?

timothylimyl opened this issue · 1 comments

Hi,

I currently have a dataset with 100k+ images, I want to cluster the images into a k number of folders so that it is easier to be annotated later on. I am thinking of using DeepCluster to learn to group the dataset into subsets and pass the subsets to annotators to re-group and delete. I understand that DeepCluster is meant for unsupervised training to create a generalised feature extractor via huge/unlimited amount of data (since we do not need annotation). However, can it be used instead to learn the necessary features to cluster custom datasets with only 100k-500k images?

Note that the custom dataset is very domain specific, e.g. , 100k+ images of different traffic signs (40+ classes). My intuition is that DeepCluster will be able to learn some specific features needed to cluster the dataset into different groups. My concern for simple clustering methods (ex: PCA + k-means) is that the optimisation solution may end up being trivial, e.g., clustering into 2 subgroups of traffic sign for night versus day and leaving all other subgroups/clusters as empty. Simple clustering methods also cannot deal with such a huge dataset as featurisation into a latent space is not done. This brought me to DeepCluster as I read that it can deal with empty clusters and trivial parametrisation while also incorporating backprop to learn cluster assignments (pseudo-labels) with better featurisation techniques (vision backbone, e.g, cnn/vit). Please correct me if I am sorely mistaken.

One major problem that comes to mind is that I do not know when to stop the training for DeepCluster.


My other way to go about this challenge will be to train AE on the dataset and cluster the encoder output (reduce dim via PCA if necessary). However, this is maybe a topic outside of this repo? Happy to chat about it too.


Benefits if DeepCluster can be used is that annotators can easily move batches of images into the correct folders instead of going 1 by 1. Secondly, we can improve the clustering algorithm by fine-tuning on subset of labelled data and maybe we can eventually end up reducing k groups to match the actual classes that we have.

Inactive