/TVmax

Primary LanguagePython

Sparse and Structured Visual Attention

Implementation of the experiments for visual question answering with sparse and structured visual attention.

Requirements

We recommend to follow the procedure in the official MCAN repository in what concerns software and hardware requirements. We also use the same setup - see there how to organize the datasets folders. The only difference is that we also use grid features; you can download them from here.

Run

pip install entmax

to install the entmax package.

Training

To train the models in the paper, run this command:

python3 run.py --RUN=train --M='mca' --gen_func=<ATTENTION> --SPLIT=train --features=<FEATURES>

with <ATTENTION>={'softmax', 'sparsemax', 'tvmax'} to train the model with softmax, sparsemax, or TVmax attention, and <FEATURES>={'grid', 'bounding_boxes'} to train the model with grid features or bounding box features.

Evaluation

The evaluations of both the VQA 2.0 test-dev and test-std splits are run as follows:

python3 run.py --RUN=test --CKPT_V=<VERSION> --CKPT_E=<EPOCH TO LOAD> --M='mca' --gen_func=<ATTENTION> --features=<FEATURES>

and the result file is stored in ./results/result_test/. The obtained result json file can be uploaded to Eval AI to evaluate the scores on test-dev and test-std splits.

Citation

@inproceedings{martins2021sparse,
  author    = {Martins, Pedro Henrique and Niculae, Vlad and Marinho, Zita and  Martins, Andr{\'e} FT},
  title     = {Sparse and Structured Visual Attention},
  booktitle = {Proc. ICIP},
  year      = {2021}
}