/XDC

Self-Supervised Learning by Cross-Modal Audio-Video Clustering (NeurIPS 2020)

Primary LanguagePythonMIT LicenseMIT

PWC PWC PWC PWC

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

[Paper] [Project Website]

This repository holds the pretrained models for the Cross-Modal Deep Clustering (XDC) method presented as a spotlight in NeurIPS 2020.

Self-Supervised Learning by Cross-Modal Audio-Video Clustering. Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran. In NeurIPS, 2020.

Load Pretrained Models

We provide the following pretrained R(2+1)D-18 video models. We report the average top-1 video-level accuracy over all splits on UCF101 and HMDB51 after full-finetuning.

Pretraining Name Description UCF101 HMDB51 Weights
r2plus1d_18_xdc_ig65m_kinetics XDC pretrained on IG-Kinetics 95.5 68.9 [PyTorch] [Caffe2]
r2plus1d_18_xdc_ig65m_random XDC pretrained on IG-Random 94.6 66.5 [PyTorch] [Caffe2]
r2plus1d_18_xdc_audioset XDC pretrained on AudioSet 93.0 63.7 [PyTorch] [Caffe2]
r2plus1d_18_fs_kinetics fully-supervised pretraining on Kinetics 94.2 65.1 [PyTorch] [Caffe2]
r2plus1d_18_fs_imagenet fully-supervised pretraining on ImageNet 84.0 48.1 [PyTorch] [Caffe2]

There are two ways to load the XDC pretrained models in PyTorch: (1) via PyTorch Hub or (2) via source code.

Via PyTorch Hub (Recommended)

You can load all our pretrained models using torch.hub.load() API.

import torch

model = torch.hub.load('HumamAlwassel/XDC', 'xdc_video_encoder', 
                        pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                        num_classes=42)

Use the parameter pretraining to specify the pretrained model to load from the table above (default pretrained model is r2plus1d_18_xdc_ig65m_kinetics). Pretrained weights of all layers except the FC classifier layer are loaded. The FC layer (of size 512 x num_classes) is randomly-initialized. Specify the keyword argument num_classes based on your application (default is 400). Run print(torch.hub.help('HumamAlwassel/XDC', 'xdc_video_encoder')) for the model documentation. Learn more about PyTorch Hub here.

Via Source Code

Clone this repo and create the conda environment.

git clone https://github.com/HumamAlwassel/XDC.git
cd XDC
conda env create -f environment.yml
conda activate xdc

Load the pretrained models from the file xdc.py.

from xdc import xdc_video_encoder

model = xdc_video_encoder(pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                          num_classes=42)

Feature Extraction and Model Finetuning

Please refer to the Facebook Video Model Zoo (VMZ) repo for PyTorch/Caffe2 scripts for feature extraction and model finetuning on datasets such as UCF101 and HMDB51.

Please cite this work if you find XDC useful for your research.

@inproceedings{alwassel_2020_xdc,
  title={Self-Supervised Learning by Cross-Modal Audio-Video Clustering},
  author={Alwassel, Humam and Mahajan, Dhruv and Korbar, Bruno and 
          Torresani, Lorenzo and Ghanem, Bernard and Tran, Du},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2020}
}