GL-RG: Global-Local Representation Granularity for Video Captioning

The code of IJCAI22 paper "GL-RG: Global-Local Representation Granularity for Video Captioning".

GL-RG exploit extensive vision representations from different video ranges to improve linguistic expression. We devise a novel global-local encoder to produce rich semantic vocabulary. With our incremental training strategy, GL-RG successfully leverages the global-local vision representation to achieve fine-grained captioning on video contents.

Dependencies

Python 2.7
Pytorch 0.2 or 1.0
Microsoft COCO Caption Evaluation
CIDEr
numpy, scikit-image, h5py, requests

This repo was tested with Python 2.7, PyTorch 1.0.1 (or 0.2.0), cuDNN 10.0 (or 6.0), with CUDA 8.0. But it should be runnable with more recent PyTorch>=1.0 (or >=0.2, <=1.0) versions.

You can use anaconda or miniconda to install the dependencies:

conda create -n GL-RG-pytorch python=2.7 pytorch=1.0 scikit-image h5py requests
conda activate GL-RG-pytorch

or you can install the dependencies following this script:

conda env create -f environment.yaml
conda activate GL-RG-pytorch

Installation

First clone the this repository to any location using --recursive:

git clone --recursive https://github.com/ylqi/GL-RG.git

Check out the coco-caption/, cider/, data/ and model/ projects into your working directory. If not, please find detailed steps INSTALL.md for installation and dataset preparation.

Then, please run following script to download Stanford CoreNLP 3.6.0 models into coco-caption/:

cd coco-caption
./get_stanford_models.sh

Datasets

Model Zoo

Models	Dataset	Exp.	B@4	M	R	C	Links
GL-RG	MSR-VTT	XE	45.5	30.1	62.6	51.2	GL-RG_XE_msrvtt
GL-RG	MSR-VTT	DXE	46.9	30.4	63.9	55.0	GL-RG_DXE_msrvtt
GL-RG + IT	MSR-VTT	DR	46.9	31.2	65.7	60.6	GL-RG_DR_msrvtt
GL-RG	MSVD	XE	55.5	37.8	74.7	94.3	GL-RG_XE_msvd
GL-RG	MSVD	DXE	57.7	38.6	74.9	95.9	GL-RG_DXE_msvd
GL-RG + IT	MSVD	DR	60.5	38.9	76.4	101.0	GL-RG_DR_msvd

Test

Check out the trained model weights under the model/ directory (following Installation) and run:

./test.sh

Note: Please modify MODEL_NAME, EXP_NAME and DATASET in test.sh if experiment setting changes. For more details please refer to TEST.md.

Train

For Seeding Phase (e.g., using XE):

./train.sh 1  # | 0 - using XE | 1 - using DXE |

For **Boosting Phase **(e.g., using DR with b1):

./train.sh 3  # | 2 - with SCST baseline | 3 - with b1 baseline | 4 - with b2 baseline |

Note: For higher performance, please increase the batch size using --batch_size in train.sh. For more variants, please set --start_from in train.sh to determine the Incremental Training entrance model, set --use_long_range, --use_short_range and --use_local to enable different global-local features:

--use_long_range: enable long-range features.
--use_short_range: enable short-range features.
--use_local: enable local-keyframe features.

Modify the DATASET (choices: 'msrvtt', 'msvd') in train.sh when switch to MSR-VTT or MSVD benchmark.

License

GL-RG is released under the MIT license.

Acknowledgements

We are truly thankful of the following prior efforts in terms of knowledge contributions and open-source repos.

SA-LSTM: Describing Videos by Exploiting Temporal Structure (ICCV'15) [paper] [implement code]
SCST: Self-critical Sequence Training for Image Captioning (CVPR'17) [paper] [implement code]
RecNet: Reconstruction Network for Video Captioning (CVPR'18) [paper] [official code]
SAAT: Syntax-Aware Action Targeting for Video Captioning (CVPR'20) [paper] [official code]

Citation

If you find our work useful in your research, please consider citing:

@InProceedings{yan2018GL-RG,
    title={GL-RG: Global-Local Representation Granularity for Video Captioning},
    author={Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu Zhang and Dongfang Liu},
    booktitle={IJCAI},
    year={2022}
}

adeljalalyousif/GL-RG