Data-efficient-Alignment

This repository is an implementation of WACV 2021 paper 'Data-efficient Alignment of Multimodal Sequences by Aligning GradientUpdates and Internal Feature Distributions'. Arxiv Preprint

Contents

Prerequisites

  • python == 3.6
  • pytorch == 1.4.0
  • tensorboard

Or conda env create -f alignment.yml.

Dataset & Feature Extraction

Dataset

The YMS_dataset directory provides the information for YouTube Movie Summaries(YMS) Dataset.

Feature Extraction

For each video clip, we extract features of the central frame using Faster-RCNN trained on the Visual Genome dataset. VG Detector of this repo @nocaps-org/image-feature-extractors is helpful and the model can be found in @peteanderson80/bottom-up-attention.

For a text snippet, we extract 768-dimensional sentence embedding from the BERT - Base model.

Highlights

  • SBN - an implementation of Sequence-wise Batch Normalization (not support Multi-GPU training now).
  • LARS - a function for adding LARS to an Adam optimizer.

Quick Start

Downloading

For the convenience of experience, we provide the processed data and a pretrained Model (RP+LARS+SBN) at Google Drive. Please put it under this directory.

Evaluation Example

After downloading the data and model, let's take an evaluation example on TEST.

CUDA_VISIBLE_DEVICES=0 python do.py \
--evaluate \
--SBN \
--random_project \
--dataset yms \
--where_best './data/RP_SBN_LARS.ckpt' 

Training Example

An example for training,

CUDA_VISIBLE_DEVICES=0 python do.py \
--lr 7 
--loss ls --lsr_epsilon 0.03 \
--adamlars --lars_coef 1e-3 \
--SBN \
--random_project \
--dataset yms \
--epochs 350