The code of IJCAI22 paper "GL-RG: Global-Local Representation Granularity for Video Captioning".
GL-RG
exploit extensive vision representations from different video ranges to improve linguistic expression. We devise a novel global-local encoder to produce rich semantic vocabulary. With our incremental training strategy, GL-RG
successfully leverages the global-local vision representation to achieve fine-grained captioning on video contents.
- Python 2.7
- Pytorch 0.2 or 1.0
- Microsoft COCO Caption Evaluation
- CIDEr
- numpy, scikit-image, h5py, requests
This repo was tested with Python 2.7, PyTorch 1.0.1 (or 0.2.0), cuDNN 10.0 (or 6.0), with CUDA 8.0. But it should be runnable with more recent PyTorch>=1.0 (or >=0.2, <=1.0) versions.
You can use anaconda or miniconda to install the dependencies:
conda create -n GL-RG-pytorch python=2.7 pytorch=1.0 scikit-image h5py requests
conda activate GL-RG-pytorch
or you can install the dependencies following this script:
conda env create -f environment.yaml
conda activate GL-RG-pytorch
First clone the this repository to any location using --recursive
:
git clone --recursive https://github.com/ylqi/GL-RG.git
Check out the coco-caption/
, cider/
, data/
and model/
projects into your working directory. If not, please find detailed steps INSTALL.md for installation and dataset preparation.
Then, please run following script to download Stanford CoreNLP 3.6.0 models into coco-caption/
:
cd coco-caption
./get_stanford_models.sh
Models | Dataset | Exp. | B@4 | M | R | C | Links |
---|---|---|---|---|---|---|---|
GL-RG | MSR-VTT | XE | 45.5 | 30.1 | 62.6 | 51.2 | GL-RG_XE_msrvtt |
GL-RG | MSR-VTT | DXE | 46.9 | 30.4 | 63.9 | 55.0 | GL-RG_DXE_msrvtt |
GL-RG + IT | MSR-VTT | DR | 46.9 | 31.2 | 65.7 | 60.6 | GL-RG_DR_msrvtt |
GL-RG | MSVD | XE | 55.5 | 37.8 | 74.7 | 94.3 | GL-RG_XE_msvd |
GL-RG | MSVD | DXE | 57.7 | 38.6 | 74.9 | 95.9 | GL-RG_DXE_msvd |
GL-RG + IT | MSVD | DR | 60.5 | 38.9 | 76.4 | 101.0 | GL-RG_DR_msvd |
Check out the trained model weights under the model/
directory (following Installation) and run:
./test.sh
Note: Please modify MODEL_NAME
, EXP_NAME
and DATASET
in test.sh
if experiment setting changes. For more details please refer to TEST.md.
For Seeding Phase (e.g., using XE):
./train.sh 1 # | 0 - using XE | 1 - using DXE |
For **Boosting Phase **(e.g., using DR with b1):
./train.sh 3 # | 2 - with SCST baseline | 3 - with b1 baseline | 4 - with b2 baseline |
Note: For higher performance, please increase the batch size using --batch_size
in train.sh
. For more variants, please set --start_from
in train.sh
to determine the Incremental Training entrance model, set --use_long_range
, --use_short_range
and --use_local
to enable different global-local features:
--use_long_range
: enable long-range features.--use_short_range
: enable short-range features.--use_local
: enable local-keyframe features.
Modify the DATASET
(choices: 'msrvtt', 'msvd') in train.sh
when switch to MSR-VTT or MSVD benchmark.
GL-RG
is released under the MIT license.
We are truly thankful of the following prior efforts in terms of knowledge contributions and open-source repos.
- SA-LSTM: Describing Videos by Exploiting Temporal Structure (ICCV'15) [paper] [implement code]
- SCST: Self-critical Sequence Training for Image Captioning (CVPR'17) [paper] [implement code]
- RecNet: Reconstruction Network for Video Captioning (CVPR'18) [paper] [official code]
- SAAT: Syntax-Aware Action Targeting for Video Captioning (CVPR'20) [paper] [official code]
If you find our work useful in your research, please consider citing:
@InProceedings{yan2018GL-RG,
title={GL-RG: Global-Local Representation Granularity for Video Captioning},
author={Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu Zhang and Dongfang Liu},
booktitle={IJCAI},
year={2022}
}