/ECCVW20_MILQT

Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering (ECCVW 2020)

Primary LanguagePythonMIT LicenseMIT

Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering

This repository is the implementation of Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering for the visual question answering task. Our single model achieved 70.93 (Test-standard, VQA 2.0). Moreover, in TDIUC dataset, our single model achieved 73.04 in Arithmetic MTP metric and 66.86 in Harmonic MTP metric.

This repository is based on and inspired by @hengyuan-hu's work and @kim's work. We sincerely thank for their sharing of the codes.

Summary

The proposed framework

Overview of MILQT

Prerequisites

You may need a machine with 1 GPUs, at least 11GB memory, and PyTorch v0.4.1 for Python 3.6.

  1. Install PyTorch with CUDNN v7.1, CUDA 9.2 and Python 3.6.
  2. Install h5py.
Installing MILQT necessary libraries

Python3

Please install dependence package by run following command:

pip install -r requirements.txt

Preprocessing

All data should be downloaded to a data/ directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

Our model is required to apply a Mixture of Detection features of Faster R-CNN and FPN as input image features to reach best performance, the image features can be found in here which should be extracted and placed in data/MoD/. Our implementation also uses the pretrained features from bottom-up-attention. The introducted image features have 10-100 adaptive features per image.

For now, you should manually download for the below options (used in our best single model).

We use a part of Visual Genome dataset for data augmentation. The image meta data is needed to be placed in data/.

We use MS COCO captions to extract semantically connected words for the extended word embeddings along with the questions of VQA 2.0 and Visual Genome. You can download in here.

Counting module (Zhang et al., 2018) is integrated in this repository as counting.py for your convenience. The source repository can be found in @Cyanogenoid's vqa-counting.

Training

$ python3 main.py --use_MoD --MoD_dir data/MoD/ --batch_size 64 --update_freq 4 --lr 7e-4 --comp_attns BAN_COUNTER,BAN,SAN --output saved_models/MILQT --use_counter --use_both --use_vg

to start training (the options for the train/val splits and Visual Genome to train, respectively). The training scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default hyper-parameters should give you the best result of single model, which is around 70.62 for test-dev split.

Validation

If you trained a model with the training split using

$ python3 main.py --use_MoD --MoD_dir data/MoD/ --batch_size 64 --update_freq 4 --lr 7e-4 --comp_attns BAN_COUNTER,BAN,SAN --output saved_models/MILQT --use_counter

then you can run evaluate.py with appropriate options to evaluate its score for the validation split.

Pretrained model

We provide the pretrained model reported as the best single model in the paper (70.62 for test-dev, 70.93 for test-standard).

Please download the pretrained_model and move to saved_models/MILQT/model_epoch12.pth. The training log is found in here.

$ python3 test.py --use_MoD --MoD_dir data/MoD/ --batch_size 64 --comp_attns BAN_COUNTER,BAN,SAN --input saved_models/MILQT --use_counter

The result json file will be found in the directory results/.

Citation

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@misc{do2020multiple,
      title={Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering},
      author={Tuong Do and Binh X. Nguyen and Huy Tran and Erman Tjiputra and Quang D. Tran and Thanh-Toan Do},
      year={2020},
      eprint={2009.11118},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

AIOZ License

More information

AIOZ AI Homepage: https://ai.aioz.io