ryanwebster90/snip-dedup

can not dowload?

sri-hk opened this issue · 5 comments

hi,nice wrok ! but the data can not be downloaded.

image
this dataset/
laion2b-en-vit-h-14-embeddings became disabled,any other soultion to get your deduplicated laion-2b-en data?

look forward to your reply,3ks!

as sri-hk said, LAION is no longer distributing laion2b and its variants. So, as I only provided a filtering of that data it won't be available anymore. I'll remove that code soon and maybe add functionality to deduplicate your own dataset / another dataset.

@ryanwebster90 Can you provide deduped results based on urls?

This way it does not require users to download the now-deleted dataset on huggingface. With urls, labs that have downloaded laion2b (but in a different format/order from the huggingface dataset) will be able to leverage your deduped results.

@ppwwyyxx I can not, as the dataset is facing ethical issues, and don't plan to. For now, I'd suggest to check out DataComp-1B, and perhaps I'll plan to deduplicate that dataset.

as sri-hk said, LAION is no longer distributing laion2b and its variants. So, as I only provided a filtering of that data it won't be available anymore. I'll remove that code soon and maybe add functionality to deduplicate your own dataset / another dataset.

Thank you for your reply.
I find another way to deduplicate. So far so good. Phash is highly efficient and also fast, but you may need many cpus and memories.