About kmeans clustering

Question

About kmeans clustering

Opened this issue 8 months ago · 2 comments

Hi,

First thanks for such a great work and making it open.

I notice in your paper you mentioned,

you can cluster 400 million samples into 1 million clustering within 10 minutes
Table 5, three cluster counts are mentioned, 100K, 1M, 10M

Can you add more details about which particular tools did you use for this clustering step?
I am very curious as usually kmeans can only handle small cluster sizes.

Thanks very much.

Answer 1 · 2024-04-28T06:21:05.000Z

We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.

The library employed for this operation was faiss-gpu.

Answer 2 · 2024-05-23T10:23:44.000Z

We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.

The library employed for this operation was faiss-gpu.

Thank you for sharing. May I ask if this portion of the code can be made open source?