
IJCAI2020: Learning to Discretely Compose Reasoning Module Networks for Video Captioning's performance on Hindi descriptions generation

Primary LanguagePython


Learning to Discretely Compose Reasoning Module Networks for Video Captioning (IJCAI2020)


This code is the Pytorch implementation of RNM forked from tgc1997 and the Hindi MSR-VTT dataset is created by alokssingh. Modification in the original code are made for the compatiablity with Hindi text. This implementation of RMN is used as a baseline model.


  • Python 3.7.3 (other versions may also work)
  • Pytorch 1.4.0 (other versions may also work)
  • pickle
  • tqdm
  • h5py
  • matplotlib
  • numpy
  • tensorboard_logger
  • CUDA 10.1
  1. Download visual features from MSR-VTT and text features from MSR-VTT-Hindi-text and put them in data folder.
  2. Download evauation tool from caption-eval

Training command example:

python train.py --dataset=msr-vtt --model=RMN --result_dir=results/msr-vtt_model --use_lin_loss \
 --learning_rate_decay --learning_rate_decay_every=5 --learning_rate_decay_rate=3 \
 --use_loc --use_rel --use_func --use_multi_gpu --learning_rate=1e-4 --attention=gumbel \
 --hidden_size=1300 --att_size=1024 --train_batch_size=32 --test_batch_size=8

Evaluation command example:

python evaluate.py --dataset=msr-vtt --model=RMN --result_dir=results/msr-vtt_model \
 --use_loc --use_rel --use_func --hidden_size=1300 --att_size=1024 \
 --test_batch_size=2 --beam_size=2 --eval_metric=CIDEr

NOTE: For METEOR score we have used meteor_indic and indic_tokenizer for tokenization


  1. Learning to Discretely Compose Reasoning Module Networks for Video Captioning
  2. Attention based video captioning framework for Hindi
  3. alokssingh
  4. tgc1997