deepglint/unicom

About kmeans clustering

Opened this issue · 2 comments

Hi,

First thanks for such a great work and making it open.

I notice in your paper you mentioned,

  • you can cluster 400 million samples into 1 million clustering within 10 minutes
  • Table 5, three cluster counts are mentioned, 100K, 1M, 10M

Can you add more details about which particular tools did you use for this clustering step?
I am very curious as usually kmeans can only handle small cluster sizes.

Thanks very much.

We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.

The library employed for this operation was faiss-gpu.

We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.

The library employed for this operation was faiss-gpu.

Thank you for sharing. May I ask if this portion of the code can be made open source?