/memorability

🎥 Winner in the Memorability Challenge at MediaEval (workshop at MMM 2019)

Primary LanguageJupyter Notebook

Insight@DCU in the Memorability Challenge at MediaEval2019

Slides at MediaEval2019 that include:

  • Our approaches & the features we computed (e.g. emotions)
  • Our results on our validation set and on the official test set
  • Visualization techniques such as Class Activation Maps for CNNs
  • Table with Results
  • Ensembles: short-term table & long-term table

Paper: Predicting Media Memorability Using Ensemble Models

Hits

Please consider citing this paper if you use any of the work:

@article{azcona2019predicting,
  title={Predicting Media Memorability Using Ensemble Models},
  author={Azcona, David and Moreu, Enric and Hu, Feiyan and Ward, Tom{\'a}s E and Smeaton, Alan F},
  year={2019},
  publisher={CEUR-WS}
}

Update: We are pleased to report we performed very well in Predicting Media Memorability at MultimediaEval Benchmarking, improving SOTA and scoring a best-in-class value of 0.528 at short-term memorability.

Team Best Short Term Best Long Term
Insight@DCU ⭐ 0.528 0.27
MeMAD 0.522 0.277
Best 2018 0.497 0.257
UPB-L25 (*) 0.477 0.232
RUC 0.472 0.216
EssexHubTV 0.467 0.203
TCNJ-CS 0.445 0.218
HCMUS 0.445 0.208
GIBIS 0.438 0.199
Baseline (MemNet) 0.39 0.17
Average 2018 0.359 0.173

Challenge

The Memorability challenge is here.

Problem: Predicting how memorable a video is to viewers i.e. the probability a video will be remembered.

Dataset

10,000 soundless short videos extracted from raw footage used by professionals when creating content, and in particular, commercials. Each video has two scores for memorability: short-term and long-term (that refer to its probability to be remembered after two different durations of memory retention).

10,000 videos: 8,000 development & 2,000 official test

Our approach

Development: 7,000 our training & 1,000 our validation

Extract 8 frames per video (first + one per second)

Individual models per set of features & then combine them using ensemble models using:

  • Traditional Machine Learning:
    • Support Vector Regression
    • Bayesian Ridge Regression
  • Deep Learning (highly regularized):
    • Embeddings for words (captions)
    • Transfer Learning w/ Neural Network activations as features
    • Transfer Learning by fine-tuning our own networks

Our models:

  • Off the shelf pre-computed features: C3D, HMP, LBP, InceptionV3, Color Histogram & Aesthetic
  • Our own pre-computed features: Our Aesthetics & Emotions
  • Textual information: bag-of-words TF-IDF with linear models & Glove's Embeddings + RNN GRU + high dropout
  • Pre-trained CNNs as feature extractors: transfer learning with ImageNet: VGG16, DenseNet121, ResNet50 & ResNet152
  • Fine-tuning our own CNN: ResNet - head + FC + sigmoid
  • Ensemble models: combinations of individual models’ predictions

Pre-computed Features

Video specialized features:

  • C3D (dimension: 101 features): final classification layer of the C3D model
  • HMP (6075 features): histogram of motion patterns for each video

Frame features, from three key-frames (first (0), one-third (56) and two-thirds (112)) on each video:

  • HoG descriptors: histograms of oriented gradients
  • LBP: local texture information
  • InceptionV3: output of the fc7layer of the InceptionV3 deep network
  • ORB (An efficient alternative to SIFT or SURF): Oriented FAST and Rotated BRIEF
  • Color Histogram: classic color histogram (three channels)
  • Aesthetic visual features: collection of features used in the prediction of visual aesthetics, composed of color, texture and object based descriptors

At Insight@DCU, we also extracted the following:

  • Our own emotions
  • Our own aesthetic visual features

Why emotions?

MediaEval 2018: Duy-Tue Tran-Van et al @ HCMUS’s paper: "Predicting Media Memorability Using Deep Features and Recurrent Network"

Long-term scores: 0.727 (left), 0.273 (right)

Our pre-computed Emotion features:

7 emotions: anger, disgust, fear, happiness, sadness, surprise, neutral; gender scores & spatial information

Technologies used in our work

Our Deployment

  1. Download the dataset (you may want to use an external drive) via FTP like here:
$ wget -m --ftp-user="<user>" --ftp-password=<password> ftp://<ftp server>

and unzip multi-part files like here:

$ zip --fix me18me-devset.zip --figures/activation_maps mybigzipfile.zip
$ unzip mybigzipfile.zip
  1. Mount the dataset as drive in /datasets in docker-compose.yml. As an example:
volumes:
  - /Volumes/HDD/datasets/:/datasets
  1. Build our docker image:
$ cd docker
$ make build
  1. Create a docker container based on the image:
$ make run
  1. CLI to the docker container:
$ make dev
  1. Extract frames from videos (one per second, resulting in 8 frames per video for both dev and test sets):
$ python src/extract_frames.py
  1. Extract emotion features from frames (run this command in a separate repo, see instructions here):
$ python src/extract_emotions.py
  1. Modify the file src/config.py to run the desired experiment.

  2. Run the training:

$ python src/train.py

Running the train script for one or several feature/s creates the corresponding predictions stored in predictions/training/.

  1. Find the best ensemble models. You can manually apply weights to each desired model's predictions or run the automated search for the best weights (by creating bins):
$ python src/ensemble_manual.py
$ python src/ensemble_auto.py
  1. Run the test:
$ python src/test.py

After running the test script, models are trained with all the training data available (8,000 videos) and the predictions for the test set are stored in predictions/test/

  1. Run submission generation:
$ python src/submit.py

The ensemble weights are manually defined and the CSV submissions are generated for 5 runs for each subtask: short-term and long-term memorability scores.

  1. [Optional] Visualizing heatmaps of class activation:
$ python src/viz_activations.py --model ResNet152

Our Results

  1. Validation Results on our Individual Models:

  2. Ensemble Results for the 5 runs each

  3. Ensemble Features for the 5 runs each

  4. Telegram chatbot to report results

Visualizations

  1. Short-term and long-term Memorability Histograms

  2. Exploring Top Captions

  3. Exploring Bottom Captions

Findings & Contributions

  • DL CNN models will typically outperform models trained with captions and other visual features for short-term memorability; however, techniques such as embeddings and RNNs can achieve very high results for captions

  • We believe fine-tuned CNN models will outperform pre-trained models as feature extractors given enough training samples (not proven in this paper)

  • Ensembling models by using predictions instead of training models with very long vectors of features is an alternative we used to counteract memory limitations

  • Ensembling models with different modalities such as emotions with captions, high-level representations from CNNs and visual pre-computed features achieve the best results as they represent different high-level abstractions

Extra: Activation Maps for CNNs

Model ResNet152 trained with ImageNet was leveraged for one of the video frames (frame 48) of the top short-term and long-term most memorable videos. This is very useful for understanding which parts of these given images led the pre-trained CNN to the ImageNet classification. This technique is called class activation map (CAM) visualization and consists of producing heatmaps of class activation over input images. For further details see Francois Chollet's Deep Learning with Python book.

Top short-term most memorable videos

  1. video798.webm. The top-4 classes predicted for this video frame are as follows:
  • 'torch': 0.23151287 (with 23.25% probability)
  • 'hatchet': 0.094463184 (with 9.44% probability)
  • 'crutch': 0.0654099 (with 6.54% probability)
  • 'pedestal': 0.06340647 (with 6.34% probability)

  1. video1981.webm:
  • 'bow_tie': 0.99436283
  • 'torch': 0.0010983162
  • 'theater_curtain': 0.00067173946
  • 'feather_boa': 0.0004574099
  • 'theater_curtain': 0.00067173946
  • 'groom': 0.00034087678

  1. video4903.webm:
  • 'television': 0.5428618
  • 'desktop_computer': 0.115691125
  • 'screen': 0.11060062
  • 'laptop': 0.06419162
  • 'monitor': 0.05998577
  • 'notebook': 0.040473375

  1. video9496.webm:
  • 'sandbar': 0.55648345
  • 'seashore': 0.13317421
  • 'lakeside': 0.03515112
  • 'wreck': 0.028257731
  • 'volcano': 0.017195351

  1. video6103.webm:
  • 'fur_coat': 0.66497004
  • 'cloak': 0.16292651
  • 'ski_mask': 0.024773473
  • 'lab_coat': 0.016840363

Top long-term most memorable videos

  1. video5186.webm:
  • 'mountain_bike': 0.8176742
  • 'bicycle-built-for-two': 0.1651485
  • 'unicycle': 0.009558631
  • 'alp': 0.0027272117

  1. video4798.webm:
  • 'jean': 0.64808583
  • 'cash_machine': 0.06661992
  • 'trench_coat': 0.026500706
  • 'wardrobe': 0.026173087
  • 'prison': 0.025266951

  1. video480.webm:
  • 'giant_schnauzer': 0.28221375
  • 'cocker_spaniel': 0.172711
  • 'Scotch_terrier': 0.11454323
  • 'Great_Dane': 0.045542818
  • 'Lakeland_terrier': 0.033769395
  • 'standard_schnauzer': 0.030899713

  1. video7606.webm:
  • 'chain_saw': 0.15715672
  • 'pole': 0.099422
  • 'hook': 0.064023055
  • 'paintbrush': 0.04958201
  • 'shovel': 0.031757597

  1. video4809.webm:
  • 'racket': 0.9964013
  • 'tennis_ball': 0.0032226138
  • 'ping-pong_ball': 0.00037128705

Learning Resources

MediaEval 2018

Regression: predicting a continous variable

Embeddings for processing video captions

Custom generators and data augmentation

Training the final model

Miscellaneous