About kmeans clustering
Opened this issue · 2 comments
Hi,
First thanks for such a great work and making it open.
I notice in your paper you mentioned,
- you can cluster 400 million samples into 1 million clustering within 10 minutes
- Table 5, three cluster counts are mentioned, 100K, 1M, 10M
Can you add more details about which particular tools did you use for this clustering step?
I am very curious as usually kmeans can only handle small cluster sizes.
Thanks very much.
We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.
The library employed for this operation was faiss-gpu.
We utilized a cluster of 20 machines, each equipped with 8 V100 GPUs, for parallel hierarchical clustering. Each V100 was responsible for clustering 20 million images into 1 million cluster centroids. Subsequently, we aggregated the centroids from all 20 machines, each contributing 1 million centroids, into a final set of 1 million centroids.
The library employed for this operation was faiss-gpu.
Thank you for sharing. May I ask if this portion of the code can be made open source?