Slides at MediaEval2019 that include:
- Our approaches & the features we computed (e.g. emotions)
- Our results on our validation set and on the official test set
- Visualization techniques such as Class Activation Maps for CNNs
- Table with Results
- Ensembles: short-term table & long-term table
Paper: Predicting Media Memorability Using Ensemble Models
Please consider citing this paper if you use any of the work:
@article{azcona2019predicting,
title={Predicting Media Memorability Using Ensemble Models},
author={Azcona, David and Moreu, Enric and Hu, Feiyan and Ward, Tom{\'a}s E and Smeaton, Alan F},
year={2019},
publisher={CEUR-WS}
}
Update: We are pleased to report we performed very well in Predicting Media Memorability at MultimediaEval Benchmarking, improving SOTA and scoring a best-in-class value of 0.528 at short-term memorability.
Team | Best Short Term | Best Long Term |
---|---|---|
Insight@DCU ⭐ | 0.528 | 0.27 |
MeMAD | 0.522 | 0.277 |
Best 2018 | 0.497 | 0.257 |
UPB-L25 (*) | 0.477 | 0.232 |
RUC | 0.472 | 0.216 |
EssexHubTV | 0.467 | 0.203 |
TCNJ-CS | 0.445 | 0.218 |
HCMUS | 0.445 | 0.208 |
GIBIS | 0.438 | 0.199 |
Baseline (MemNet) | 0.39 | 0.17 |
Average 2018 | 0.359 | 0.173 |
The Memorability challenge is here.
Problem: Predicting how memorable a video is to viewers i.e. the probability a video will be remembered.
10,000 soundless short videos extracted from raw footage used by professionals when creating content, and in particular, commercials. Each video has two scores for memorability: short-term and long-term (that refer to its probability to be remembered after two different durations of memory retention).
10,000 videos: 8,000 development & 2,000 official test
Development: 7,000 our training & 1,000 our validation
Extract 8 frames per video (first + one per second)
Individual models per set of features & then combine them using ensemble models using:
- Traditional Machine Learning:
- Support Vector Regression
- Bayesian Ridge Regression
- Deep Learning (highly regularized):
- Embeddings for words (captions)
- Transfer Learning w/ Neural Network activations as features
- Transfer Learning by fine-tuning our own networks
- Off the shelf pre-computed features: C3D, HMP, LBP, InceptionV3, Color Histogram & Aesthetic
- Our own pre-computed features: Our Aesthetics & Emotions
- Textual information: bag-of-words TF-IDF with linear models & Glove's Embeddings + RNN GRU + high dropout
- Pre-trained CNNs as feature extractors: transfer learning with ImageNet: VGG16, DenseNet121, ResNet50 & ResNet152
- Fine-tuning our own CNN: ResNet - head + FC + sigmoid
- Ensemble models: combinations of individual models’ predictions
Video specialized features:
- C3D (dimension: 101 features): final classification layer of the C3D model
- HMP (6075 features): histogram of motion patterns for each video
Frame features, from three key-frames (first (0), one-third (56) and two-thirds (112)) on each video:
- HoG descriptors: histograms of oriented gradients
- LBP: local texture information
- InceptionV3: output of the fc7layer of the InceptionV3 deep network
- ORB (An efficient alternative to SIFT or SURF): Oriented FAST and Rotated BRIEF
- Color Histogram: classic color histogram (three channels)
- Aesthetic visual features: collection of features used in the prediction of visual aesthetics, composed of color, texture and object based descriptors
At Insight@DCU, we also extracted the following:
- Our own emotions
- Our own aesthetic visual features
MediaEval 2018: Duy-Tue Tran-Van et al @ HCMUS’s paper: "Predicting Media Memorability Using Deep Features and Recurrent Network"
Long-term scores: 0.727 (left), 0.273 (right)
7 emotions: anger, disgust, fear, happiness, sadness, surprise, neutral; gender scores & spatial information
- Download the dataset (you may want to use an external drive) via FTP like here:
$ wget -m --ftp-user="<user>" --ftp-password=<password> ftp://<ftp server>
and unzip multi-part files like here:
$ zip --fix me18me-devset.zip --figures/activation_maps mybigzipfile.zip
$ unzip mybigzipfile.zip
- Mount the dataset as drive in /datasets in docker-compose.yml. As an example:
volumes:
- /Volumes/HDD/datasets/:/datasets
- Build our docker image:
$ cd docker
$ make build
- Create a docker container based on the image:
$ make run
- CLI to the docker container:
$ make dev
- Extract frames from videos (one per second, resulting in 8 frames per video for both dev and test sets):
$ python src/extract_frames.py
- Extract emotion features from frames (run this command in a separate repo, see instructions here):
$ python src/extract_emotions.py
-
Modify the file src/config.py to run the desired experiment.
-
Run the training:
$ python src/train.py
Running the train script for one or several feature/s creates the corresponding predictions stored in predictions/training/.
- Find the best ensemble models. You can manually apply weights to each desired model's predictions or run the automated search for the best weights (by creating bins):
$ python src/ensemble_manual.py
$ python src/ensemble_auto.py
- Run the test:
$ python src/test.py
After running the test script, models are trained with all the training data available (8,000 videos) and the predictions for the test set are stored in predictions/test/
- Run submission generation:
$ python src/submit.py
The ensemble weights are manually defined and the CSV submissions are generated for 5 runs for each subtask: short-term and long-term memorability scores.
- [Optional] Visualizing heatmaps of class activation:
$ python src/viz_activations.py --model ResNet152
-
DL CNN models will typically outperform models trained with captions and other visual features for short-term memorability; however, techniques such as embeddings and RNNs can achieve very high results for captions
-
We believe fine-tuned CNN models will outperform pre-trained models as feature extractors given enough training samples (not proven in this paper)
-
Ensembling models by using predictions instead of training models with very long vectors of features is an alternative we used to counteract memory limitations
-
Ensembling models with different modalities such as emotions with captions, high-level representations from CNNs and visual pre-computed features achieve the best results as they represent different high-level abstractions
Model ResNet152 trained with ImageNet was leveraged for one of the video frames (frame 48) of the top short-term and long-term most memorable videos. This is very useful for understanding which parts of these given images led the pre-trained CNN to the ImageNet classification. This technique is called class activation map (CAM) visualization and consists of producing heatmaps of class activation over input images. For further details see Francois Chollet's Deep Learning with Python book.
- video798.webm. The top-4 classes predicted for this video frame are as follows:
- 'torch': 0.23151287 (with 23.25% probability)
- 'hatchet': 0.094463184 (with 9.44% probability)
- 'crutch': 0.0654099 (with 6.54% probability)
- 'pedestal': 0.06340647 (with 6.34% probability)
- video1981.webm:
- 'bow_tie': 0.99436283
- 'torch': 0.0010983162
- 'theater_curtain': 0.00067173946
- 'feather_boa': 0.0004574099
- 'theater_curtain': 0.00067173946
- 'groom': 0.00034087678
- video4903.webm:
- 'television': 0.5428618
- 'desktop_computer': 0.115691125
- 'screen': 0.11060062
- 'laptop': 0.06419162
- 'monitor': 0.05998577
- 'notebook': 0.040473375
- video9496.webm:
- 'sandbar': 0.55648345
- 'seashore': 0.13317421
- 'lakeside': 0.03515112
- 'wreck': 0.028257731
- 'volcano': 0.017195351
- video6103.webm:
- 'fur_coat': 0.66497004
- 'cloak': 0.16292651
- 'ski_mask': 0.024773473
- 'lab_coat': 0.016840363
- video5186.webm:
- 'mountain_bike': 0.8176742
- 'bicycle-built-for-two': 0.1651485
- 'unicycle': 0.009558631
- 'alp': 0.0027272117
- video4798.webm:
- 'jean': 0.64808583
- 'cash_machine': 0.06661992
- 'trench_coat': 0.026500706
- 'wardrobe': 0.026173087
- 'prison': 0.025266951
- video480.webm:
- 'giant_schnauzer': 0.28221375
- 'cocker_spaniel': 0.172711
- 'Scotch_terrier': 0.11454323
- 'Great_Dane': 0.045542818
- 'Lakeland_terrier': 0.033769395
- 'standard_schnauzer': 0.030899713
- video7606.webm:
- 'chain_saw': 0.15715672
- 'pole': 0.099422
- 'hook': 0.064023055
- 'paintbrush': 0.04958201
- 'shovel': 0.031757597
- video4809.webm:
- 'racket': 0.9964013
- 'tennis_ball': 0.0032226138
- 'ping-pong_ball': 0.00037128705
- MediaEval 2018: http://multimediaeval.org/mediaeval2018/memorability/index.html
- Presentation at MediaEval 2018 - Predicting Media Memorability: https://www.slideshare.net/multimediaeval/mediaeval-2018-predicting-media-memorability
- Proceedings of the MediaEval 2018 Workshop: http://ceur-ws.org/Vol-2283/
- Keras & Regression: https://www.pyimagesearch.com/2019/01/21/regression-with-keras/
- Keras custom metrics: https://machinelearningmastery.com/custom-metrics-deep-learning-keras-python/
- Stanford's GloVe: https://nlp.stanford.edu/projects/glove/
- Pre-trained word embeddings: https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py
- https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
- https://towardsdatascience.com/machine-learning-word-embedding-sentiment-classification-using-keras-b83c28087456
- https://medium.com/the-artificial-impostor/custom-image-augmentation-with-keras-70595b01aeac
- https://www.pyimagesearch.com/2018/12/24/how-to-use-keras-fit-and-fit_generator-a-hands-on-tutorial/
- https://keras.io/utils/#sequence
- https://medium.com/datadriveninvestor/keras-training-on-large-datasets-3e9d9dbc09d4
- How to Train a Final Machine Learning Model: https://machinelearningmastery.com/train-final-machine-learning-model/
- Five video classification methods implemented in Keras and TensorFlow: https://blog.coast.ai/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5