Using for large 10M or 100M datasets

Question

Using for large 10M or 100M datasets

stabilize-ai opened this issue 10 months ago · 3 comments

Thanks for the great work and code !

I was going over the code ad realize that the bin creation relies on an N x N similarity matrix, where N is the number of examples code line.

That would create lead to memory issues when scaling to large datasets with 10 M or 100 M examples because that would need a matrix of size 10Mx10M or 100Mx100M.

Have you thought about suggestions to address those use-case ?

Answer 1 · 2023-11-19T12:20:34.000Z

Thanks for the question!

The memory issue indeed is a problem when dealing with large datasets. And it is actually a general problem for dataset processing methods like coreset selection and clustering. It can be addressed from optimization and system perspectives.
I believe there are quite a lot of works aiming at conducting efficient clustering on large-scale datasets, where you can seek for some insights. For example, the original demanding 10Mx10M matrix can be approximated through smaller matrices at multiple nodes.

In this work, we are mainly dealing with CIFAR and ImageNet, where the data is no more than 1M level. Trying to extend the method to larger levels like ImageNet-21K would also be significantly meaningful.

Answer 2 · 2023-11-20T00:37:30.000Z

Thank you @vimar-gu ; I'll look through works on these (1) optimization; and (2) system perspectives. Do you also have some papers / repos in mind that you like for these topics ?

Answer 3 · 2023-11-20T06:54:11.000Z

As I'm not quite familiar with this area, I can only give limited advices. You can refer to papers like:

A Global Optimization Algorithm for K-Center Clustering of One Billion Samples. Jiayang Ren, et. al.

Also there is a python package you can use for dealing with huge-scale data: Vaex