/Meta-Concepts-for-Video-Captioning

Learn domain-specific meta concepts for video captioning

Primary LanguagePython

Cross-Modal Graph With Meta Concepts for Video Captioning

Official PyTorch implementation

Cross-Modal Graph With Meta Concepts for Video Captioning
IEEE Transactions on Image Processing (TIP)
Hao Wang, Guosheng Lin, Steven C. H. Hoi, and Chunyan Miao
Paper

Requirements

  • pytorch 1.2 or higher
  • python 3.6 or higher
git clone --recurse-submodules https://github.com/hwang1996/Meta-Concepts-for-Video-Captioning

Dataset

Please download MSR-VTT dataset from here to use our codes.

Preparation

  • Extract video key frames
cd preprocess/
python extract_key_frames.py
  • Use the weakly learning approach to produce meta concepts
cd meta_concept_loc/weakly_learning
python train.py
python generate_synonym.py
python extract_mask.py
  • Train the segmentation model for meta concept inference
cd meta_concept_loc/segmentation
python train_custom.py
python extract_fea.py
  • Please refer to this repo to extract scene graphs

Video captioning training

  • Cross-entropy training
cd captioning
bash run_train.sh
  • Reinforcement learning
bash run_rl_train.sh

Testing

bash run_test.sh

Acknowledgement

Our code builds upon several previous works:

Reference

If you find this repository useful, please cite:

@article{wang2022cross,
  title={Cross-modal graph with meta concepts for video captioning},
  author={Wang, Hao and Lin, Guosheng and Hoi, Steven CH and Miao, Chunyan},
  journal={IEEE Transactions on Image Processing},
  volume={31},
  pages={5150--5162},
  year={2022},
  publisher={IEEE}
}