can not dowload?
sri-hk opened this issue · 5 comments
hi,nice wrok ! but the data can not be downloaded.
this dataset/
laion2b-en-vit-h-14-embeddings became disabled,any other soultion to get your deduplicated laion-2b-en data?
look forward to your reply,3ks!
+1
as sri-hk said, LAION is no longer distributing laion2b and its variants. So, as I only provided a filtering of that data it won't be available anymore. I'll remove that code soon and maybe add functionality to deduplicate your own dataset / another dataset.
@ryanwebster90 Can you provide deduped results based on urls?
This way it does not require users to download the now-deleted dataset on huggingface. With urls, labs that have downloaded laion2b (but in a different format/order from the huggingface dataset) will be able to leverage your deduped results.
@ppwwyyxx I can not, as the dataset is facing ethical issues, and don't plan to. For now, I'd suggest to check out DataComp-1B, and perhaps I'll plan to deduplicate that dataset.
as sri-hk said, LAION is no longer distributing laion2b and its variants. So, as I only provided a filtering of that data it won't be available anymore. I'll remove that code soon and maybe add functionality to deduplicate your own dataset / another dataset.
Thank you for your reply.
I find another way to deduplicate. So far so good. Phash is highly efficient and also fast, but you may need many cpus and memories.