Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R., Harwath, D., Glass, J. and Kuehne, H., 2021. Everything at Once – Multi-modal Fusion Transformer for Video Retrieval. arXiv preprint arXiv:2112.04446.
Accepted at CVPR 2022!
Repository contains:
- the code to conduct all experiments reported in the paper
- model weights to obtain main results
- data for fine-tuning and evaluation on the YouCook2 and MSR-VTT datasets
-
Create an environment:
conda create python=3.6 -y -n everything_at_once conda activate everything_at_once conda install -y pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch pip install gensim==3.8.0 sacred==0.8.2 humanize==3.14.0 transformers==4.10.2 librosa==0.8.1 timm==0.4.12 pip install neptune-contrib==0.28.1 --ignore-installed certifi
-
If needed, download
data.tar
with features and spectrograms to fine-tune and evaluate on YouCook2 and MSR-VTT here. Extract a tar:tar -xvf data.tar
-
If needed, create
pretrained_models
folder and download model weights here:- Everything-At_Once (ResNet-152,ResNeXt-101)
- Everything-At_Once (ResNet-152,ResNeXt-101), fine-tuned on YouCook2
- Everything-At_Once (ResNet-152,ResNeXt-101), fine-tuned on MSR-VTT
- Everything-At_Once (S3D)
- Everything-At_Once (CLIP)
- Everything-At_Once (ResNet-152,ResNeXt-101), text-video only
Extract a tar:
cd pretrained_models tar -xvf everything_at_once_tva.tar
To evaluate a pretrained everything-at-once model on the MSR-VTT dataset, run:
python test.py --n_gpu 1 \
--config configs/evaluation/msrvtt_at_once.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
On the YouCook2 dataset:
python test.py --n_gpu 1 \
--config configs/evaluation/youcook_at_once.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
Check out configs/evaluation
folder to find more configs
for evaluating models trained with S3D or CLIP features,
or using other strategies to process long videos.
To fine-tune the HowTo100M-pretrained model on the MSR-VTT dataset, run:
python train.py \
--config configs/finetuning/finetune_msrvtt.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
Add --neptune
key if you want to log experiments using neptune.ai (See Experiment Logging)
On the YouCook2 dataset:
python train.py \
--config configs/finetuning/finetune_youcook.yaml \
--resume pretrained_models/everything_at_once_tva/latest_model.pth
Add --neptune
key if you want to log experiments using neptune.ai (See Experiment Logging)
Check out configs/finetunning/clip
folder to find configs
for fine-tuning with CLIP features.
-
Downloading HowTo100M and feature extraction. Please note that HowTo100M videos require a huge storage, and features alone take up terabytes of space. Features extraction (ResNet-152,ResNeXt-101) and audio spectrogram extraction were carefully described in https://github.com/roudimit/AVLnet/blob/main/training.md. We will release the code for S3D and CLIP feature extraction.
-
Review
configs/pretraining/everything_at_once_tva.yaml
and make surecsv
,features_path
,features_path_audio
, andcaption_path
point on the correct paths. -
Train
python train.py --config configs/pretraining/everything_at_once_tva.yaml
Add --neptune
key if you want to log experiments using neptune.ai (See Experiment Logging)
Check out configs/pretraining
folder to find more configs for different ablation experiments.
This repository uses Sacred with a neptune.ai for logging and tracking experiments. If you want to activate this:
- Create a neptune.ai account.
- Create a project, copy in your credentials (api_token, project_name) in
train.py
- Add
--neptune
key to the training (e.g.python train.py --neptune ..
)
If you want to use the model on your own data, please follow steps described in https://github.com/roudimit/AVLnet for features extraction and audio spectrogram extraction.
You may also take a look at everything_at_once_tva.yaml
, where some comments about how to define n_video_tokens
and num_audio_STFT_frames
are provided.
If you use this code in your research, please cite:
@article{shvetsova2021everything,
title={Everything at Once--Multi-modal Fusion Transformer for Video Retrieval},
author={Shvetsova, Nina and Chen, Brian and Rouditchenko, Andrew and Thomas, Samuel and Kingsbury, Brian and Feris, Rogerio and Harwath, David and Glass, James and Kuehne, Hilde},
journal={arXiv preprint arXiv:2112.04446},
year={2021}
}
If you have any problems with the code or have a question, please open an issue or send an email to shvetsova at em.uni-frankfurt.de. I'll try to answer as soon as possible.
The main structure of the code is based on the frozen-in-time code: https://github.com/m-bain/frozen-in-time, which itself is based on the pytorch-template https://github.com/victoresque/pytorch-template. Thanks for sharing good practices!
The code in davenet.py
, layers.py
, avlnet.py
is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).