PyTorch implementation on Distilling Audio-Visual Knowledge by Compositional Contrastive Learning.
Distilling knowledge from the pre-trained teacher models helps to learn a small student model that generalizes better. While existing works mostly focus on distilling knowledge within the same modality, we explore to distill the multi-modal knowledge available in video data (i.e. audio and vision). Specifically, we propose to transfer audio and visual knowledge from pre-trained image and audio teacher models to learn more expressive video representations.
In multi-modal distillation, there often exists a semantic gap across modalities, e.g. a video shows applying lipstick visually while its accompanied audio is music. To ensure effective multi-modal distillation in the presence of a cross-modal semantic gap, we propose compositional contrastive learning, which features learnable compositional embeddings to close the cross-modal semantic gap, and a multi-class contrastive distillation objective to align different modalities jointly in the shared latent space.
We demonstrate our method can distill knowledge from the audio and visual modalities to learn a stronger video model for recognition and retrieval tasks on video action recognition datasets.
- python >= 3.6.10
- pytorch == 1.1.0
- FFmpeg, FFprobe
- Download datasets: UCF101, ActivityNet, VGGSound
- audio features are extracted based on the audio pre-trained model PANNs. The UCF101 audio features are provided under the directory
dataset/UCF101
. Please uncompress theaudiocnn14embed512_features.tar.gz
file for details. - video data is convert to the
hdf5
format using the following command. Please specify the data directory${UCF101_DATA_DIR}
, e.g.datasets/UCF101/UCF-101
. Note: video data can be downloaded here.
python util_scripts/generate_video_hdf5.py --dir_path=${UCF101_DATA_DIR} --dst_path=datasets/UCF101/hdf5data --dataset=ucf101
- prepare the
json
file for dataloader using the following command. Note: official data splits can be downloaded here.
python util_scripts/ucf101_json.py --dir_path=datasets/UCF101/ucfTrainTestlist --video_path=datasets/UCF101/hdf5data --audio_path=datasets/UCF101/audiocnn14embed512_features --dst_path=datasets/UCF101/ --video_type=hdf5
The running commands for both training and testing are written in the same script file. Experiments are conducted on 2 gpus. Please refer to the script files in the directory scripts
for details. Use the folllowing commands to test on the UCF51 dataset.
- baseline (w/o distillation)
sh scripts/run_baseline.sh
- CCL (A): distilling audio knowledge from the pre-trained audio teacher model (audiocnn14)
sh scripts/run_ccl_audio.sh
- CCL (I): distilling image knowledge from the pre-trained image teacher model (resnet34)
sh scripts/run_ccl_image.sh
- CCL (AI): distilling audio and image knowledge from the pre-trained audio and image teacher models
sh scripts/run_ccl_ai.sh
@inproceedings{chen2021distilling,
title={Distilling Audio-Visual Knowledge by Compositional Contrastive Learning},
author={Chen, Yanbei and Xian, Yongqin and Koepke, Sophia and Shan, Ying and Akata, Zeynep},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
year={2021},
organization={IEEE}
}
This repository is partially built with two open-source implementation: (1) 3D-ResNets-PyTorch is used in video data preparation; (2) PANNs is used for audio feature extraction.