rom1504/laion-prepro

How to download the newest version of dataset without duplicate files?

qiaogh97 opened this issue · 2 comments

Hi, @rom1504
I know there are three versions of the parquet files as below.

Version Parquet file size Hash value Total size
1.0 1.6G 5b54c5d5 400 million
2.0 3.6G 03f11a48 800 million
3.0 4.9G f27692e1 1.1 billion

So I wonder know if the parquet files in different versions are one-to-one correspondence.
I download the 400 million version dataset. What should I do if I'd like to download the newest version of the dataset without downloading the duplicate files?

Hi,
All three versions you mention are free of duplicate and are subset of each other, ie version 2 contains 1, 3 contains 2.

Only the 400M version (the first one) is properly released by us (that's the one we call laion400m) and you can get it from https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ or https://www.kaggle.com/romainbeaumont/laion400m

The other 2 versions you mention are work in progress, and are not yet fully ready for use (for example these versions 2 and 3 are not fully randomly shuffled unlike version 1, which is an important property for use of the dataset)

We will release a larger version of the dataset with a few billions samples in a few months.

Do you have any deadlines / uses of the larger dataset (larger than 400m) on your side?

It doesn't matter, I don't have any deadlines.