deep-learning-content-moderation
Various sources for deep learning based content moderation, sensitive content detection, scene genre classification, nudity detection, violence detection, substance detection from text, audio, video & image input modalities.
table of contents
- datasets
- techniques
- tools
datasets
movie and content moderation datasets
name | paper | year | url | input modality | task | labels |
---|---|---|---|---|---|---|
LSPD | 2022 | page | image, video | image/video classification, instance segmentation | porn, normal, sexy, hentai, drawings, female/male genital, female breast, anus | |
MM-Trailer | 2021 | page | video | video classification | age rating | |
Movienet | scholar | 2021 | page | image, video, text | object detection, video classification | scene level actions and places, character bboxes |
Movie script severity dataset | 2021 | github | text | text classification | frightening, mild, moderate, severe | |
LVU | 2021 | page | video | video classification | relationship, place, like ration, view count, genre, writer, year per movie scene | |
Violence detection dataset | scholar | 2020 | github | video | video classification | violent, not-violent |
Movie script dataset | 2019 | github | text | text classification | violent or not | |
Nudenet | github | 2019 | archive.org | image | image classification | nude or not |
Adult content dataset | 2017 | contact | image | image classification | nude or not | |
Substance use dataset | 2017 | first author | image | image classification | drug related or not | |
NDPI2k dataset | 2016 | contact | video | video classification | porn or not | |
Violent Scenes Dataset | springer | 2014 | page | video | video classification | blood, fire, gun, gore, fight |
VSD2014 | 2014 | download | video | video classification | blood, fire, gun, gore, fight | |
AIIA-PID4 | 2013 | - | image | image classification | bikini, porn, skin, non-skin | |
NDPI800 dataset | scholar | 2013 | page | video | video classification | porn or not |
HMDB-51 | scholar | 2011 | page | video | video classification | smoke, drink |
techniques
sensitive content detection
movie content rating
name | paper | year | model | features | datasets | tasks | context |
---|---|---|---|---|---|---|---|
Movies2Scenes: Learning Scene Representations Using Movie Similarities | scholar | 2022 | ViT-like video encoder + MLP | ViT-like video encoder embedings | Private, Movienet, LVU | movie scene representation learning, video classifcation (sex, violence, drug-use) | movie scene content rating |
Detection and Classification of Sensitive Audio-Visual Content for Automated Film Censorship and Rating | 2022 | CNN + GRU + MLP | CNN embeddings from video frames | Violence detection dataset | violent/non-violent classification from videos | movie scene content rating | |
Automatic parental guide ratings for short movies | page | 2021 | separate model for each task: concat + LSTM, object detector, one-class CNN embeddings | video frame pixel values, image embeddings, text | Nudenet, private dataset | profanity, violence, nudity, drug classification | movie content rating |
From None to Severe: Predicting Severity in Movie Scripts | scholar | 2021 | multi-task pairwise ranking-classification network | GloVe, Bert and TextCNN text embeddings | Movie script severity dataset | rating classifcation (frightening, mild, moderate, severe) | movie content rating |
A Case Study of Deep Learning-Based Multi-Modal Methods for Labeling the Presence of Questionable Content in Movie Trailers | scholar | 2021 | multi-modal + multi output concat+MLP | CNN+LSTM video features, Bert and DeepMoji text embeddings, MFCC audio features | MM-Trailer | rating classifcation (red, yellow, green) | movie trailer content rating |
Automatic Parental Guide Scene Classification Menggunakan Metode Deep Convolutional Neural Network Dan Lstm | scholar | 2020 | 3 CNN model for 3 modality, multi-label dataset | CNN video and audio embeddings, LSTM text (subitle) embeddings | private dataset | gore, nudity, drug, profanity classification from video and subtitle | movie scene content rating |
Multimodal data fusion for sensitive scene localization | scholar | 2019 | meta-learning with Naive Bayes, SVM | MFCC and prosodic features from audio, HOG and TRoF features from images | Pornography-2k dataset, VSD2014 | violent and pornographic scene localization from video | movie scene content rating |
A Deep Learning approach for the Motion Picture Content Rating | scholar | 2019 | MLP + rule-based decision | InceptionV3 image embeddings | Violent Scenes Dataset, private dataset | violence (shooting, blood, fire, weapon) classification from video | movie scene content rating |
Hybrid System for MPAA Ratings of Movie Clips Using Support Vector Machine | springer | 2019 | SVM | DCT features from image | private dataset | movie content rating classification from images | movie content rating |
Inappropriate scene detection in a video stream | page | 2017 | SVM classifier + Lenet image classifier + rules-based decision | HoG and CNN features for image | private dataset | image classification: no/mild/high violence, safe/unsafe/pornoghraphy | movie frame content rating |
content moderation
name | paper | year | model | features | datasets | tasks | context |
---|---|---|---|---|---|---|---|
Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild | scholar | 2022 | novel threshold optimization tech. (TruSThresh) | prediction scores | UnSmile (Korean hatespeech dataset) | optimum threshold prediction | social media content moderation |
On-Device Content Moderation | scholar | 2021 | mobilenet v3 + SSD object detector | mobilenet v3 image embeddings | private dataset | object detection + nudity classification from images | on-device content moderation |
Gore Classification and Censoring in Images | scholar | 2021 | ensemble of CNN + MLP | mobilenet v2, densenent, vgg16 image embeddings | private dataset | gore classification from images | general content moderation |
Automated Censoring of Cigarettes in Videos Using Deep Learning Techniques | scholar | 2020 | CNN + MLP | inception v3 image embeddings | private dataset | cigarette classification from video | general content moderation |
A Multimodal CNN-based Tool to Censure Inappropriate Video Scenes | scholar | 2019 | CNN + SVM | InceptionV3 image embeddings, AudioVGG audio embeddings | private dataset | inappropriate (nudity+gore) classification from video | general video content moderation |
A baseline for NSFW video detection in e-learning environments | scholar | 2019 | concat + SVM, MLP | InceptionV3 image embeddings, AudioVGG audio embeddings | YouTube8M, NDPI, Cholec80 | nudity classification from video | e-learning content moderation |
Bringing the kid back into youtube kids: Detecting inappropriate content on video streaming platforms | scholar | 2019 | CNN + LSTM (late fusion) | CNN based encoder for image, video and audio spectrograms | private dataset | video classification: orignal, fake explicit, fake violent | social media content moderation |
movie/scene genre classification
name | paper | year | model | features | datasets | tasks |
---|---|---|---|---|---|---|
Effectively leveraging Multi-modal Features for Movie Genre Classification | scholar | 2022 | embeddings + fusion + MLP | CLIP image embeddings, PANNs audio embeddings, CLIP text embeddings | MovieNet | movie genre classification |
OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification | scholar | 2022 | embeddings + novel transformer | ResNet-18 image embeddings, ResNet-VLAD audio embeddings | TI-News | news scene segmentation/classification (studio, outdoor, interview) |
Detection of Animated Scenes Among Movie Trailers | scholar | 2022 | CNN + GRU | EfficientNet image embeddings | Private dataset | genre classification from movie trailer scenes |
A multi-label movie genre classification scheme based on the movie's subtitles | springer | 2022 | KNN | text frequency vectors | Private dataset | genre classification from movie subtitle text |
A multimodal approach for multi-label movie genre classification | scholar | 2020 | CNN + LSTM | MFCCs/SSD/LBP from audio, LBP/3DCNN from video frames, Inception-v3 from poster, TFIDF from text | Private dataset | genre classification from movie trailers |
Genre classification of movie trailers using 3d convolutional neural networks | ieee | 2020 | 3D CNN | images | Private dataset | genre classification from movie trailer scenes |
A unified framework of deep networks for genre classification using movie trailer | scholar | 2020 | CNN + LSTM | Inception V4 image embeddings | EmoGDB | genre classification from movie trailer scenes |
Towards story-based classification of movie scenes | scholar | 2020 | logistic regression | manually extracted categorical features | Flintstones Scene Dataset | scene classification (Obstacle, Midpoint, Climax of Act 1) |
multimodal architectures
synchronous multimodal architectures
name | paper | year | model | features | datasets | tasks | modalities |
---|---|---|---|---|---|---|---|
M&M Mix: A Multimodal Multiview Transformer Ensemble | scholar | 2022 | transformer with 2 cls heads | ViT image embeddings from audio spect., frame image, optical flow | Epic-Kitchens | video/action classification | image + audio + optical flow |
MultiMAE: Multi-modal Multi-task Masked Autoencoders | scholar | 2022 | transformer with 3 decoder + cls heads | ViT-like image enc. patch embeddings (optional modalities) | ImageNet: Pseudo labeled multi-task training dataset (depth, segm) | image cs., semantic segm., depth est. | image + depth map |
Data2vec: A general framework for self-supervised learning in speech, vision and language | scholar | 2022 | single encoder | transformer based audio, text, image encoder embeddings | ImageNet, Librispeech | masked pretraining | image + audio + text |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | scholar | 2022 | 1 encoder per modality | transformer based audio, text, image encoder embeddings | AudioSet, HowTo100M | pretraining + video/audio classification | image + audio + text |
Expanding Language-Image Pretrained Models for General Video Recognition | scholar | 2022 | 1 encoder per modality | transformer based video, text encoder embeddings | HMDB-51, UCF-101 | contrastive pretraining | video + text |
Audio-Visual Instance Discrimination with Cross-Modal Agreement | scholar | 2021 | 1 encoder per modality | CNN based audio, video encoder embeddings | HMDB-51, UCF-101 | video/audio classification | video + audio |
Robust Audio-Visual Instance Discrimination | scholar | 2021 | 1 encoder per modality | CNN based audio, video encoder embeddings | HMDB-51, UCF-101 | video/audio classification | video + audio |
Learning transferable visual models from natural language supervision | scholar | 2021 | 1 encoder per modality | transformer based image, text encoder embeddings | JFT-300M | contrastive pretraining | image + text |
Self-supervised multimodal versatile networks | scholar | 2020 | multiple encoders | CNN based image/audio embeddings, word2vec text embeddings | UCF101, Kinetics, AudioSet | contrastive pretraining + classification | image + audio + text |
Uniter: Universal image-text representation learning | scholar | 2020 | multimodal encoder | combined embeddings | COCO, Visual Genome, Conceptual Captions | qa/image-text retrieval | image + text |
12-in-1: Multi-task vision and language representation learning | scholar | 2020 | multimodal encoder | combined embeddings | COCO, Flickr30k | qa/image-text retrieval | image + text |
Two-stream convolutional networks for action recognition in videos | scholar | 2014 | 1 encoder per modality | CNN based audio, text encoder embeddings | HMDB-51, UCF-101 | video/audio classification | video + optical flow |
asynchronous multimodal architectures
name | paper | year | model | features | datasets | tasks | modalities |
---|---|---|---|---|---|---|---|
OmniMAE: Single Model Masked Pretraining on Images and Videos | scholar | 2022 | transformer with 1 cls. head | ViT-like image/video enc. patch embeddings | ImageNet, SSv2 | video/action classification | image + video |
OMNIVORE: A Single Model for Many Visual Modalities | scholar | 2022 | transformer with 3 cls. heads | ViT-like image/video enc. patch embeddings | ImageNet, Kinetics, SSv2, SUN RGB-D | image cls., action recog., depth est. | image + video + depth map |
Polyvit: Co-training vision transformers on images, videos and audio | scholar | 2021 | transformer with 9 cls. heads | ViT-like image/video/audio enc. embeddings | ImageNet, CIFAR, Kinetics, Moments in Time, AudioSet, VGGSound | image cls., video cls., audio cls. | image + video + audio |
action recognition
with transformers
name | paper | year | model | features | datasets | tasks |
---|---|---|---|---|---|---|
Frozen CLIP Models are Efficient Video Learners | scholar | 2022 | transformer with 1 cls head | CLIP image embeddings | ImageNet, Kinetics, SSv2 | action recognition |
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training | scholar | 2022 | transformer with 1 cls head | ViT-like video enc. patch embeddings | Kinetics, SSv2 | action recognition |
Bevt: Bert pretraining of video transformers | scholar | 2022 | encoder-decoder transformer | VideoSwin image/video enc. embeddings | Kinetics, SSv2 | action recognition |
Video swin transformer | scholar | 2022 | Swin trans. with cls.head | Swin video enc. embeddings | Kinetics, SSv2 | action recognition |
Is space-time attention all you need for video understanding? | scholar | 2021 | transformer with cls. head | ViT-like video enc. patch embeddings | Kinetics, SSv2 | action recognition |
with 3D CNNs
name | paper | year | model | features | datasets | tasks |
---|---|---|---|---|---|---|
X3d: Expanding architectures for efficient video recognition | scholar | 2020 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, SSv2 | action recognition |
Slowfast networks for video recognition | scholar | 2019 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, SSv2 | action recognition |
A closer look at spatiotemporal convolutions for action recognition (R2+1D) | scholar | 2018 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, HMDB-51, UCF-101 | action recognition |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D) | scholar | 2017 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, HMDB-51, UCF-101 | action recognition |
contrastive representation learning
name | paper | date |
---|---|---|
Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text | scholar | 2021 |
Supervised contrastive learning | scholar | 2020 |
review papers
name | paper | date |
---|---|---|
Machine Learning Models for Content Classification in Film Censorship and Rating | 2022 | |
A survey of artificial intelligence strategies for automatic detection of sexually explicit videos | scholar | 2022 |
A survey on video content rating: taxonomy, challenges and open issues | 2021 | |
Multimodal Learning with Transformers: A Survey | scholar | 2022 |
A Survey Paper on Movie Trailer Genre Detection | scholar | 2020 |
tools
name | url | description |
---|---|---|
better-profanity | github | fast swear word detection from strings |
PySceneDetect | github | Python and OpenCV-based scene cut/transition detection program & library |
LAION safety toolkit | github | NSFW detector trained on LAION dataset |
pysrt | github | Python parser for SubRip (srt) files |
ffsubsync | github | Automagically synchronize subtitles with video. |
MoviePy | github | Video editing with Python |