Self-Supervised Learning by Cross-Modal Audio-Video Clustering

[Paper] [Project Website]

This repository holds the pretrained models for the Cross-Modal Deep Clustering (XDC) method presented as a spotlight in NeurIPS 2020.

Self-Supervised Learning by Cross-Modal Audio-Video Clustering. Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran. In NeurIPS, 2020.

Load Pretrained Models

We provide the following pretrained R(2+1)D-18 video models. We report the average top-1 video-level accuracy over all splits on UCF101 and HMDB51 after full-finetuning.

Pretraining Name	Description	UCF101	HMDB51	Weights
`r2plus1d_18_xdc_ig65m_kinetics`	XDC pretrained on IG-Kinetics	95.5	68.9	[PyTorch] [Caffe2]
`r2plus1d_18_xdc_ig65m_random`	XDC pretrained on IG-Random	94.6	66.5	[PyTorch] [Caffe2]
`r2plus1d_18_xdc_audioset`	XDC pretrained on AudioSet	93.0	63.7	[PyTorch] [Caffe2]
`r2plus1d_18_fs_kinetics`	fully-supervised pretraining on Kinetics	94.2	65.1	[PyTorch] [Caffe2]
`r2plus1d_18_fs_imagenet`	fully-supervised pretraining on ImageNet	84.0	48.1	[PyTorch] [Caffe2]

There are two ways to load the XDC pretrained models in PyTorch: (1) via PyTorch Hub or (2) via source code.

Via PyTorch Hub (Recommended)

You can load all our pretrained models using torch.hub.load() API.

import torch

model = torch.hub.load('HumamAlwassel/XDC', 'xdc_video_encoder', 
                        pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                        num_classes=42)

Use the parameter pretraining to specify the pretrained model to load from the table above (default pretrained model is r2plus1d_18_xdc_ig65m_kinetics). Pretrained weights of all layers except the FC classifier layer are loaded. The FC layer (of size 512 x num_classes) is randomly-initialized. Specify the keyword argument num_classes based on your application (default is 400). Run print(torch.hub.help('HumamAlwassel/XDC', 'xdc_video_encoder')) for the model documentation. Learn more about PyTorch Hub here.

Via Source Code

Clone this repo and create the conda environment.

git clone https://github.com/HumamAlwassel/XDC.git
cd XDC
conda env create -f environment.yml
conda activate xdc

Load the pretrained models from the file xdc.py.

from xdc import xdc_video_encoder

model = xdc_video_encoder(pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                          num_classes=42)

Feature Extraction and Model Finetuning

Please refer to the Facebook Video Model Zoo (VMZ) repo for PyTorch/Caffe2 scripts for feature extraction and model finetuning on datasets such as UCF101 and HMDB51.

Please cite this work if you find XDC useful for your research.

@inproceedings{alwassel_2020_xdc,
  title={Self-Supervised Learning by Cross-Modal Audio-Video Clustering},
  author={Alwassel, Humam and Mahajan, Dhruv and Korbar, Bruno and 
          Torresani, Lorenzo and Ghanem, Bernard and Tran, Du},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2020}
}

BizhuWu/XDC

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Load Pretrained Models

Via PyTorch Hub (Recommended)

Via Source Code

Feature Extraction and Model Finetuning