How to download the newest version of dataset without duplicate files?
qiaogh97 opened this issue · 2 comments
Hi, @rom1504
I know there are three versions of the parquet files as below.
Version | Parquet file size | Hash value | Total size |
---|---|---|---|
1.0 | 1.6G | 5b54c5d5 | 400 million |
2.0 | 3.6G | 03f11a48 | 800 million |
3.0 | 4.9G | f27692e1 | 1.1 billion |
So I wonder know if the parquet files in different versions are one-to-one correspondence.
I download the 400 million version dataset. What should I do if I'd like to download the newest version of the dataset without downloading the duplicate files?
Hi,
All three versions you mention are free of duplicate and are subset of each other, ie version 2 contains 1, 3 contains 2.
Only the 400M version (the first one) is properly released by us (that's the one we call laion400m) and you can get it from https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ or https://www.kaggle.com/romainbeaumont/laion400m
The other 2 versions you mention are work in progress, and are not yet fully ready for use (for example these versions 2 and 3 are not fully randomly shuffled unlike version 1, which is an important property for use of the dataset)
We will release a larger version of the dataset with a few billions samples in a few months.
Do you have any deadlines / uses of the larger dataset (larger than 400m) on your side?
It doesn't matter, I don't have any deadlines.