Insight@DCU in the Memorability Challenge at MediaEval2019

Slides at MediaEval2019 that include:

Our approaches & the features we computed (e.g. emotions)
Our results on our validation set and on the official test set
Visualization techniques such as Class Activation Maps for CNNs
Table with Results
Ensembles: short-term table & long-term table

Paper: Predicting Media Memorability Using Ensemble Models

Please consider citing this paper if you use any of the work:

@article{azcona2019predicting,
  title={Predicting Media Memorability Using Ensemble Models},
  author={Azcona, David and Moreu, Enric and Hu, Feiyan and Ward, Tom{\'a}s E and Smeaton, Alan F},
  year={2019},
  publisher={CEUR-WS}
}

Update: We are pleased to report we performed very well in Predicting Media Memorability at MultimediaEval Benchmarking, improving SOTA and scoring a best-in-class value of 0.528 at short-term memorability.

Team	Best Short Term	Best Long Term
Insight@DCU ⭐	0.528	0.27
MeMAD	0.522	0.277
Best 2018	0.497	0.257
UPB-L25 (*)	0.477	0.232
RUC	0.472	0.216
EssexHubTV	0.467	0.203
TCNJ-CS	0.445	0.218
HCMUS	0.445	0.208
GIBIS	0.438	0.199
Baseline (MemNet)	0.39	0.17
Average 2018	0.359	0.173

Challenge

The Memorability challenge is here.

Problem: Predicting how memorable a video is to viewers i.e. the probability a video will be remembered.

Dataset

10,000 soundless short videos extracted from raw footage used by professionals when creating content, and in particular, commercials. Each video has two scores for memorability: short-term and long-term (that refer to its probability to be remembered after two different durations of memory retention).

10,000 videos: 8,000 development & 2,000 official test

Our approach

Development: 7,000 our training & 1,000 our validation

Extract 8 frames per video (first + one per second)

Individual models per set of features & then combine them using ensemble models using:

Traditional Machine Learning:
- Support Vector Regression
- Bayesian Ridge Regression
Deep Learning (highly regularized):
- Embeddings for words (captions)
- Transfer Learning w/ Neural Network activations as features
- Transfer Learning by fine-tuning our own networks

Our models:

Off the shelf pre-computed features: C3D, HMP, LBP, InceptionV3, Color Histogram & Aesthetic
Our own pre-computed features: Our Aesthetics & Emotions
Textual information: bag-of-words TF-IDF with linear models & Glove's Embeddings + RNN GRU + high dropout
Pre-trained CNNs as feature extractors: transfer learning with ImageNet: VGG16, DenseNet121, ResNet50 & ResNet152
Fine-tuning our own CNN: ResNet - head + FC + sigmoid
Ensemble models: combinations of individual models’ predictions

Pre-computed Features

Video specialized features:

C3D (dimension: 101 features): final classification layer of the C3D model
HMP (6075 features): histogram of motion patterns for each video

Frame features, from three key-frames (first (0), one-third (56) and two-thirds (112)) on each video:

HoG descriptors: histograms of oriented gradients
LBP: local texture information
InceptionV3: output of the fc7layer of the InceptionV3 deep network
ORB (An efficient alternative to SIFT or SURF): Oriented FAST and Rotated BRIEF
Color Histogram: classic color histogram (three channels)
Aesthetic visual features: collection of features used in the prediction of visual aesthetics, composed of color, texture and object based descriptors

At Insight@DCU, we also extracted the following:

Our own emotions
Our own aesthetic visual features

Why emotions?

MediaEval 2018: Duy-Tue Tran-Van et al @ HCMUS’s paper: "Predicting Media Memorability Using Deep Features and Recurrent Network"

Long-term scores: 0.727 (left), 0.273 (right)

Our pre-computed Emotion features:

7 emotions: anger, disgust, fear, happiness, sadness, surprise, neutral; gender scores & spatial information

Technologies used in our work

Our Deployment

Download the dataset (you may want to use an external drive) via FTP like here:

$ wget -m --ftp-user="<user>" --ftp-password=<password> ftp://<ftp server>

and unzip multi-part files like here:

$ zip --fix me18me-devset.zip --figures/activation_maps mybigzipfile.zip
$ unzip mybigzipfile.zip

Mount the dataset as drive in /datasets in docker-compose.yml. As an example:

volumes:
  - /Volumes/HDD/datasets/:/datasets

Build our docker image:

$ cd docker
$ make build

Create a docker container based on the image:

$ make run

CLI to the docker container:

$ make dev

Extract frames from videos (one per second, resulting in 8 frames per video for both dev and test sets):

$ python src/extract_frames.py

Extract emotion features from frames (run this command in a separate repo, see instructions here):

$ python src/extract_emotions.py

Modify the file src/config.py to run the desired experiment.
Run the training:

$ python src/train.py

Running the train script for one or several feature/s creates the corresponding predictions stored in predictions/training/.

Find the best ensemble models. You can manually apply weights to each desired model's predictions or run the automated search for the best weights (by creating bins):

$ python src/ensemble_manual.py
$ python src/ensemble_auto.py

Run the test:

$ python src/test.py

After running the test script, models are trained with all the training data available (8,000 videos) and the predictions for the test set are stored in predictions/test/

Run submission generation:

$ python src/submit.py

The ensemble weights are manually defined and the CSV submissions are generated for 5 runs for each subtask: short-term and long-term memorability scores.

[Optional] Visualizing heatmaps of class activation:

$ python src/viz_activations.py --model ResNet152

Our Results

Validation Results on our Individual Models:
Ensemble Results for the 5 runs each
Ensemble Features for the 5 runs each
Telegram chatbot to report results

Visualizations

Short-term and long-term Memorability Histograms
Exploring Top Captions
Exploring Bottom Captions

Findings & Contributions

DL CNN models will typically outperform models trained with captions and other visual features for short-term memorability; however, techniques such as embeddings and RNNs can achieve very high results for captions
We believe fine-tuned CNN models will outperform pre-trained models as feature extractors given enough training samples (not proven in this paper)
Ensembling models by using predictions instead of training models with very long vectors of features is an alternative we used to counteract memory limitations
Ensembling models with different modalities such as emotions with captions, high-level representations from CNNs and visual pre-computed features achieve the best results as they represent different high-level abstractions

Extra: Activation Maps for CNNs

Model ResNet152 trained with ImageNet was leveraged for one of the video frames (frame 48) of the top short-term and long-term most memorable videos. This is very useful for understanding which parts of these given images led the pre-trained CNN to the ImageNet classification. This technique is called class activation map (CAM) visualization and consists of producing heatmaps of class activation over input images. For further details see Francois Chollet's Deep Learning with Python book.