/GL-RG

The code of IJCAI22 paper "GL-RG: Global-Local Representation Granularity for Video Captioning".

Primary LanguagePythonMIT LicenseMIT

GL-RG: Global-Local Representation Granularity for Video Captioning

PyTorch 1.6.0 License MIT docs issues Github stars

framework.png

The code of IJCAI22 paper "GL-RG: Global-Local Representation Granularity for Video Captioning".

GL-RG exploit extensive vision representations from different video ranges to improve linguistic expression. We devise a novel global-local encoder to produce rich semantic vocabulary. With our incremental training strategy, GL-RG successfully leverages the global-local vision representation to achieve fine-grained captioning on video contents.

Dependencies

This repo was tested with Python 2.7, PyTorch 1.0.1 (or 0.2.0), cuDNN 10.0 (or 6.0), with CUDA 8.0. But it should be runnable with more recent PyTorch>=1.0 (or >=0.2, <=1.0) versions.

You can use anaconda or miniconda to install the dependencies:

conda create -n GL-RG-pytorch python=2.7 pytorch=1.0 scikit-image h5py requests
conda activate GL-RG-pytorch

or you can install the dependencies following this script:

conda env create -f environment.yaml
conda activate GL-RG-pytorch

Installation

First clone the this repository to any location using --recursive:

git clone --recursive https://github.com/ylqi/GL-RG.git

Check out the coco-caption/, cider/, data/ and model/ projects into your working directory. If not, please find detailed steps INSTALL.md for installation and dataset preparation.

Then, please run following script to download Stanford CoreNLP 3.6.0 models into coco-caption/:

cd coco-caption
./get_stanford_models.sh

Datasets

Model Zoo

Models Dataset Exp. B@4 M R C Links
GL-RG MSR-VTT XE 45.5 30.1 62.6 51.2 GL-RG_XE_msrvtt
GL-RG MSR-VTT DXE 46.9 30.4 63.9 55.0 GL-RG_DXE_msrvtt
GL-RG + IT MSR-VTT DR 46.9 31.2 65.7 60.6 GL-RG_DR_msrvtt
GL-RG MSVD XE 55.5 37.8 74.7 94.3 GL-RG_XE_msvd
GL-RG MSVD DXE 57.7 38.6 74.9 95.9 GL-RG_DXE_msvd
GL-RG + IT MSVD DR 60.5 38.9 76.4 101.0 GL-RG_DR_msvd

Test

Check out the trained model weights under the model/ directory (following Installation) and run:

./test.sh

Note: Please modify MODEL_NAME, EXP_NAME and DATASET in test.sh if experiment setting changes. For more details please refer to TEST.md.

Train

For Seeding Phase (e.g., using XE):

./train.sh 1  # | 0 - using XE | 1 - using DXE |

For **Boosting Phase **(e.g., using DR with b1):

./train.sh 3  # | 2 - with SCST baseline | 3 - with b1 baseline | 4 - with b2 baseline |

Note: For higher performance, please increase the batch size using --batch_size in train.sh. For more variants, please set --start_from in train.sh to determine the Incremental Training entrance model, set --use_long_range, --use_short_range and --use_local to enable different global-local features:

  • --use_long_range: enable long-range features.
  • --use_short_range: enable short-range features.
  • --use_local: enable local-keyframe features.

Modify the DATASET (choices: 'msrvtt', 'msvd') in train.sh when switch to MSR-VTT or MSVD benchmark.

License

GL-RG is released under the MIT license.

Acknowledgements

We are truly thankful of the following prior efforts in terms of knowledge contributions and open-source repos.

Citation

If you find our work useful in your research, please consider citing:

@InProceedings{yan2018GL-RG,
    title={GL-RG: Global-Local Representation Granularity for Video Captioning},
    author={Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu Zhang and Dongfang Liu},
    booktitle={IJCAI},
    year={2022}
}