Code and data for the paper Global-shared Text Representation based Multi-Stage Fusion Transformer Network for Multi-modal Dense Video Captioning
NOTE: This repo is still under construction.
Please cite the following paper if you use this repository in your research.
Under construction
- sh run_caption.sh
#configure the training
DATA_SELECT=2000 # SELECT 2000 or 9000
- sh run_proposal.sh
- pretrained_cap_path=./checkpoint/train_cap/{caption_file_path}/best_cap_model.pt
- prop_result_path=./log/train_prop/{propsal_file_path}/submissions/prop_results_val_1_e{best_epoch}_maxprop100.json
- sh run_eval_props.sh