rom1504/laion-prepro

How many about the dataset?

qiaogh97 opened this issue · 3 comments

Hi, @rom1504
I download the 32 parquet files and compute the total of url. I find about 26760000 urls in every parquet, and 32*26760000 = 800 million. But you said the number of this dataset is 400m?
So what is the difference?

Hi, where did you download the parquet from?
http://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ has laion400m

If you downloaded from 3080.rom1504.fr you probably got a more recent version of the dataset that is indeed much bigger (and not really released yet)

Ah yes I see I left that 3080 link in the readme, i need to fix it :)

Ok, I see