/hgr_v2t

Code accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".

Primary LanguagePythonMIT LicenseMIT

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

This repository contains PyTorch implementation of our paper Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning (CVPR 2020).

Overview of HGR Model

Prerequisites

Python 3 and PyTorch 1.3.

# clone the repository
git clone git@github.com:cshizhe/hgr_v2t.git
cd hgr_v2t
export PYTHONPATH=$(pwd):${PYTHONPATH}

Datasets

We provide annotations, pretrained features on MSRVTT, TGIF, VATEX and Youtube2Text video captioning datasets, which can be downloaded from BaiduNetdisk (code: vxpi).

Annotations

  • groundtruth: annotation/RET directory
  1. ref_captions.json: dict, {videoname: [sent]}
  2. sent2rolegraph.augment.json: {sent: (graph_nodes, graph_edges)}
  • vocabularies: annotation/RET directory int2word.npy: [word] word2int.json: {word: int}

  • data splits: public_split directory trn_names.npy, val_names.npy, tst_names.npy

Features

For MSRVTT, TGIF and Youtube2Text datasets, we extract features with Resnet152 pretrained on ImageNet. For VATEX dataset, we use the I3D features released by VATEX challenge organizers.

  • mean pooling features: ordered_feature/MP directory

format: np array, shape=(num_fts, dim_ft) corresponding to the order in data_split names

  • frame-level features: ordered_feature/SA directory

format: hdf5 file, {name: ft}, ft.shape=(num_frames, dim_ft)

Fine-grained Binary Selection Annotation

We construct the fine-grained binary selection dataset based on the testing set of Youtube2Text dataset. The annotations are in the Youtube2Text/annotation/binary_selection directory.

Training & Inference

Semantic Graph Construction

We provided constructed role graph annotations. If you want to generate role graphs for new datasets, please follow the following instructions.

  1. semantic role labeling:
python misc/semantic_role_labeling.py ref_caption_file out_file --cuda_device 0
  1. convert sentence into role graph:
cd misc
jupyter notebook
# open parse_sent_to_role_graph.ipynd

Training and Evaluation

  1. The baseline VSE++ model:
cd t2vretrieval/driver

# setup config files
# you should modify data paths in configs/prepare_globalmatch_configs.py
python configs/prepare_globalmatch_cofig.py $datadir
resdir='' # copy the output string of the previous step

# training
python global_match.py $resdir/model.json $resdir/path.json --is_train --resume_file $resdir/../../word_embeds.glove42b.th

# inference
python global_match.py $resdir/model.json $resdir/path.json --eval_set tst
  1. Our HGR model:
cd t2vretrieval/driver

# setup config files
# you should modify data paths in configs/prepare_mlmatch_configs.py
python configs/prepare_mlmatch_configs.py $datadir
resdir='' # copy the output string of the previous step

# training
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --is_train --resume_file $resdir/../../word_embeds.glove42b.th

# inference
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --eval_set tst

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{chen2020fine,
  title={Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning},
  author={Chen, Shizhe and Zhao, Yida and Jin, Qin and Wu, Qi},
  journal={CVPR},
  year={2020}
}

License

MIT License