This repository holds the pretrained models for the Cross-Modal Deep Clustering (XDC) method presented as a spotlight in NeurIPS 2020.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering. Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran. In NeurIPS, 2020.
We provide the following pretrained R(2+1)D-18 video models. We report the average top-1 video-level accuracy over all splits on UCF101 and HMDB51 after full-finetuning.
Pretraining Name | Description | UCF101 | HMDB51 | Weights |
---|---|---|---|---|
r2plus1d_18_xdc_ig65m_kinetics |
XDC pretrained on IG-Kinetics | 95.5 | 68.9 | [PyTorch] [Caffe2] |
r2plus1d_18_xdc_ig65m_random |
XDC pretrained on IG-Random | 94.6 | 66.5 | [PyTorch] [Caffe2] |
r2plus1d_18_xdc_audioset |
XDC pretrained on AudioSet | 93.0 | 63.7 | [PyTorch] [Caffe2] |
r2plus1d_18_fs_kinetics |
fully-supervised pretraining on Kinetics | 94.2 | 65.1 | [PyTorch] [Caffe2] |
r2plus1d_18_fs_imagenet |
fully-supervised pretraining on ImageNet | 84.0 | 48.1 | [PyTorch] [Caffe2] |
There are two ways to load the XDC pretrained models in PyTorch: (1) via PyTorch Hub or (2) via source code.
You can load all our pretrained models using torch.hub.load()
API.
import torch
model = torch.hub.load('HumamAlwassel/XDC', 'xdc_video_encoder',
pretraining='r2plus1d_18_xdc_ig65m_kinetics',
num_classes=42)
Use the parameter pretraining
to specify the pretrained model to load from the table above (default pretrained model is r2plus1d_18_xdc_ig65m_kinetics
). Pretrained weights of all layers except the FC classifier layer are loaded. The FC layer (of size 512 x num_classes
) is randomly-initialized. Specify the keyword argument num_classes
based on your application (default is 400).
Run print(torch.hub.help('HumamAlwassel/XDC', 'xdc_video_encoder'))
for the model documentation. Learn more about PyTorch Hub here.
Clone this repo and create the conda environment.
git clone https://github.com/HumamAlwassel/XDC.git
cd XDC
conda env create -f environment.yml
conda activate xdc
Load the pretrained models from the file xdc.py
.
from xdc import xdc_video_encoder
model = xdc_video_encoder(pretraining='r2plus1d_18_xdc_ig65m_kinetics',
num_classes=42)
Please refer to the Facebook Video Model Zoo (VMZ) repo for PyTorch/Caffe2 scripts for feature extraction and model finetuning on datasets such as UCF101 and HMDB51.
Please cite this work if you find XDC useful for your research.
@inproceedings{alwassel_2020_xdc,
title={Self-Supervised Learning by Cross-Modal Audio-Video Clustering},
author={Alwassel, Humam and Mahajan, Dhruv and Korbar, Bruno and
Torresani, Lorenzo and Ghanem, Bernard and Tran, Du},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2020}
}