Does https://github.com/rom1504/laion-prepro/blob/main/laion5B/safety/join.py work for non-en langs?
PranshuBansalDev opened this issue · 13 comments
Issue
Our team requires removal of all nsfw content (especially nudity)
Fix
I see here - https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ we are pointed at this script:
https://github.com/rom1504/laion-prepro/blob/main/laion5B/safety/join.py
However, I see references to 2B rather than 5B
Question
Is the script above usable for non-en langs? Or does the script only work for en langs?
it works for all 3 sub datasets of laion5B (laion2B-en laion2B-multi laion1B-nolang), you can get the tags from https://huggingface.co/datasets/laion/laion2B-multi-safety and similar links
you may also choose to download directly the prejoined collection that already contain the safety tag https://huggingface.co/datasets/laion/laion2B-en-joined (and similar)
we computed these tags from the clip image embeddings so it works regardless of the language, you can see for yourself in https://rom1504.github.io/clip-retrieval/ that it detect (almost) all nudity (you can check/uncheck safety and search for some keywords that would usually result in unsafe results)
Is there any chance the laion2B-multi could have a laion2B-multi-joined? Or is that work recommended to be done by the consumers of the data?
Is there any chance the laion2B-multi could have a laion2B-multi-joined? Or is that work recommended to be done by the consumers of the data?
haha you found the missing piece. Yeah indeed that last join is still running, it will be available in a few hours. The other 2 datasets are available joined already ;)
Few unrelated questions about the dataset in general (please let me know if you'd rather deal with these in separate ticket)
- https://huggingface.co/datasets/laion/laion2B-multi - will this eventually have a "dataset preview" available?
- Is there an additional column for lang info on laion2b-multi?
- What is the value of the nolang dataset?
- What do the numbers mean w.r.t.
Number of unsafe samples with a probability threshold of 0.5: 0.033
? Does it mean that 3.3% of the data is labelled as NSFW? - Could we have a per dataset "metadata" similar to how you had one for the 400M case?
i.e. this thing was super helpful
URL and caption metadata dataset.
We provide 32 parquet files of size around 1GB (total 50GB) with the image URLs, the associated texts and additional metadata in the following format:
SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT
where
SAMPLE_ID: A unique identifier
LICENSE: Where we found a Creative Commons License in the image data, we named it here like, e.g. “creativecommons.org/licenses/by-nc-sa/3.0/” – otherwise you’ll find it here a “?”
NSFW: we used CLIP to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing false negatives at the cost of more false positives. Possible values are “UNLIKELY”, “UNSURE” and “NSFW”.
similarity: Value of the cosine similarity between the text and image embedding
WIDTH and HEIGHT: image size as the image was embedded. We downsized originals that were larger than 4K to 4K.
This metadata dataset purpose is to download the images for the whole dataset or a subset of it by supplying it to the very efficient [img2dataset](https://github.com/rom1504/img2dataset) tool.
https://huggingface.co/datasets/laion/laion2B-multi - will this eventually have a "dataset preview" available?
I believe so, but I have no control over it, it's dependent on hf infra
Is there an additional column for lang info on laion2b-multi?
yes
What is the value of the nolang dataset?
I believe it's useful if you want to train on all languages at once. Probably it contain data that is fairly unique as well. For example names often cannot be identified as a specific language and would appear more often in laion1B
What do the numbers mean w.r.t. Number of unsafe samples with a probability threshold of 0.5: 0.033 ? Does it mean that 3.3% of the data is labelled as NSFW?
yes. Note that the classifier is a bit conservative and will classify as NSFW pictures of "sexy" (and not naked) people for example
Could we have a per dataset "metadata" similar to how you had one for the 400M case?
do you mean a description of fields ?
it's actually the same as laion400M for the non-joined metadata, for the joined metadata it has punsafe and pwatermark on top
but noted, I will add that to the post
btw if you can say @PranshuBansalDev ; what are you working on? what are your plans with the datasets?
Ok multi joined is on hf too now
btw if you can say @PranshuBansalDev ; what are you working on? what are your plans with the datasets?
Sorry, I'm not able to disclose at this time :(
That's ok. I hope it works for you!
Feel free to close this one out, thank you so much!
https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ I've added the column descriptions and more stats there