Multi-View Attention Networks for Visual Dialog

Implements the model described in the following paper Multi-View Attention Networks for Visual Dialog.

@article{park2020multi,
  title={Multi-View Attention Networks for Visual Dialog},
  author={Park, Sungjin and Whang, Taesun and Yoon, Yeochan and Lim, Heuiseok},
  journal={arXiv preprint arXiv:2004.14025},
  year={2020}
}

This code is reimplemented as a fork of batra-mlp-lab/visdial-challenge-starter-pytorch and yuleiniu/rva.

Setup and Dependencies

This code is implemented using PyTorch v1.3.1, and provides out of the box support with CUDA 10 and CuDNN 7.
Anaconda / Miniconda is the recommended to set up this codebase.

Anaconda or Miniconda

Clone this repository and create an environment:

git clone https://www.github.com/taesunwhang/MVAN-VisDial
conda create -n mvan_visdial python=3.7

# activate the environment and install all dependencies
conda activate mvan_visdial
cd MVAN-VisDial/
pip install -r requirements.txt

Data

Download the VisDial v0.9 and v1.0 dialog json files from here and keep it under $PROJECT_ROOT/data/v0.9 and $PROJECT_ROOT/data/v1.0 directory, respectively.
batra-mlp-lab provides the word counts for VisDial v1.0 train split visdial_1.0_word_counts_train.json. They are used to build the vocabulary. Keep it under $PROJECT_ROOT/data/v1.0 directory.
If you wish to use preprocessed textaul inputs, we provide preprocessed data here and keep it under $PROJECT_ROOT/data/visdial_1.0_text directory.
For the pre-extracted image features of VisDial v1.0 images, batra-mlp-lab provides Faster-RCNN image features pre-trained on Visual Genome. Keep it under $PROJECT_ROOT/data/visdial_1.0_img and set argument img_feature_type to faster_rcnn_x101 in config/hparams.py file.

features_faster_rcnn_x101_train.h5: Bottom-up features of 36 proposals from images of train split.
features_faster_rcnn_x101_val.h5: Bottom-up features of 36 proposals from images of val split.
features_faster_rcnn_x101_test.h5: Bottom-up features of 36 proposals from images of test split.

gicheonkang provides pre-extracted Faster-RCNN image features, which contain bounding boxes information. Set argument img_feature_type to dan_faster_rcnn_x101 in config/hparams.py file.

train_btmup_f.hdf5: Bottom-up features of 10 to 100 proposals from images of train split (32GB).
train_imgid2idx.pkl: image_id to bbox index file for train split
val_btmup_f.hdf5: Bottom-up features of 10 to 100 proposals from images of validation split (0.5GB).
val_imgid2idx.pkl: image_id to bbox index file for val split
test_btmup_f.hdf5: Bottom-up features of 10 to 100 proposals from images of test split (2GB).
test_imgid2idx.pkl: image_id to bbox index file for test split

Initializing GloVe Word Embeddings

Download the GloVe pretrained word vectors from here, and keep glove.6B.300d.txt under $PROJECT_ROOT/data/word_embeddings/glove directory.
Simply run

python data/preprocess/init_glove.py

Model Training

Train the model provided in this repository as:

python main.py --model mvan --version 1.0

Initializing with Pre-trained Model

Set argument load_pthpath to /path/to/checkpoint.pth in config/hparams.py file.

Saving model checkpoints

Model chekcpoints are saved for every epoch. Set argument save_dirpath in config/hparams.py file.

Evaluation

Evaluation of a trained model checkpoint can be done as follows:

python evaluate.py --model mvan --evaluate /path/to/checkpoint.pth --eval_split val

If you wish to evaluate the model on test split, replace --eval_split val with --eval_split test.
This will generate a json file for each split to evaluate on various metrics (Mean reciprocal rank, R@{1, 5, 10}, Mean rank, and Normalized Discounted Cumulative Gain (NDCG)).
If you wish to evaluate on test split, EvalAI provides the evaluation server.