This repository contains the official implementation to reproduce the results for our VASTA: Diverse Video Captioning by Adaptive Spatio-temporal Attention Paper accpted GCPR 2022 link.
To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.
It contains the following sections:
- Data download
- Requirements and Setup
- Training the models
- Pre-trained checkpoints
- Evaluating the trained models.
To start you need to clone this repository and cd
into the root directory.
git clone https://github.com/zohrehghaderi/VASTA.git
cd VASTA
We show results on two datasets MSVD and MSR-VTT. We provide output of our adaptive frame selection method in data\dataset_name\index_32
and ralated lables in sematics network are in data\dataset_name\tag
. As well, normalized captions are data\dataset_name\file.pickle
. For using this code, it is important to download videos of both dataset and put in data\dataset_name\videos
. For example, MSVD dataset is following this tree:
data
|--MSVD
|--index_32 \\ output adaptive frame selection
|--tag \\ extracted tag for semantics network
|--videos \\ video
|-MSVD_vocab.pkl \\ word dictionary
|-full_test.pickle \\ to evalute NLP Metrics on test data
|-full_val.pickle \\ to evalute NLP Metrics on validation data
|-tag.npy \\ tag dictionary
|-test_data.pickle \\ test video name and related caption
|-train_data.pickle \\ train video name and related caption
|-val_data.pickle \\ val video name and related caption
To download MSVD, follow this link
To download MSR-VTT, follow this link
To run our coda, create a conda environment with this command.
conda env update -f environment.yml -n TED
conda activate TED
This will install all dependencies described in our environment.yml
file.
To download the weights of the Swin-B network, refer to Link and then put in checkpoint/swin
In this repository, pycocoevalcap is emplyed into nlp_metrics
folder to evaluate validition and test data.
We show several models in our paper (AFS-Swin-Bert-semantics, UFS-Swin-Bert-semantics, AFS-Swin-Bert, UFS-Swin-Bert) with --afs and --semantics being True or False. The latter ones are ablations.
DATASET_NAME
is msvd or msrvtt.
To train our best TED-VC model use this command:
python main.py --afs=True --dataset=DATASET_NAME --semantics=True --ckp_semantics=checkpoint/semantics_net/DATASET_NAME/semantics.ckpt
To train our best TED-VC model which does not use Adaptive Frame Selection (AFS) use this command:
python main.py --afs=False --dataset=DATASET_NAME --semantics=True --ckp_semantics=checkpoint/semantics_net/DATASET_NAME/semantics.ckpt
Additionally, you can find pre-trained checkpoints of our model here
Model Name | Dataset | Link |
---|---|---|
AFS-Swin-Bert-semantics | MSVD | link |
UFS-Swin-Bert-semantics | MSVD | link |
AFS-Swin-Bert-semantics | MSRVTT | link |
UFS-Swin-Bert-semantics | MSRVTT | link |
To train our best TED-VC model use this command:
python test.py --afs=True --dataset=DATASET_NAME --semantics=True --bestmodel=LINK_BESTMODEL
for example for MSVD dataset:
python test.py --afs=True --dataset=msvd --semantics=True --bestmodel=bestmodel/msvd/AFSSemantics.ckpt
To train our best TED-VC model which does not use Adaptive Frame Selection (AFS) use this command:
python test.py --afs=False --dataset=DATASET_NAME --semantics=True --bestmodel=LINK_BESTMODEL
This should produce the following results :
Model Name | Dataset | Bleu-4 | METEOR | CIDER | ROUGE-L |
---|---|---|---|---|---|
AFS-Swin-Bert-semantics | MSVD | 56.14 | 39.09 | 106.3 | 74.47 |
UFS-Swin-Bert-semantics | MSVD | 54.30 | 38.18 | 102.7 | 74.28 |
AFS-Swin-Bert-semantics | MSRVTT | 43.43 | 30.24 | 55.00 | 62.54 |
UFS-Swin-Bert-semantics | MSRVTT | 43.51 | 29.75 | 53.59 | 62.27 |
To train our best TED-VC model which does not use Adaptive Frame Selection (AFS) and semantics network use this command:
python test.py --afs=False --dataset=DATASET_NAME --semantics=False --bestmodel=LINK_BESTMODEL
Note that this is a confidential code release only meant for the purpose of reviewing our submission.
@inproceedings{ghaderi2022diverse,
title={Diverse Video Captioning by Adaptive Spatio-temporal Attention},
author={Ghaderi, Zohreh and Salewski, Leonard and Lensch, Hendrik PA},
booktitle={DAGM German Conference on Pattern Recognition},
pages={409--425},
year={2022},
organization={Springer}
}
This readme is inspired by https://github.com/paperswithcode/releasing-research-code/blob/master/templates/README.md.