Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [CVPR2021]


The problem of grounding VQA tasks has seen an increased attention in the research community recently, with most attempts usually focusing on solving this task by using pretrained object detectors which require bounding box annotations for detecting relevant objects in the vocabulary, which may not always be feasible for real-life large-scale applications. In this paper, we focus on a more relaxed setting: the grounding of relevant visual entities in a weakly supervised manner by training on the VQA task alone. To address this problem, we propose a visual capsule module with a query-based selection mechanism of capsule features, that allows the model to focus on relevant regions based on the textual cues about visual information in the question. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task. Overall, we demonstrate the effectiveness of our approach on two state-of-the-art VQA systems, stacked NMN and MAC, on the CLEVR-Answers benchmark, our new evaluation set based on CLEVR scenes with groundtruth bounding boxes for objects that are relevant for the correct answer, as well as on GQA, a real world VQA dataset with compositional questions. We show that the systems with the proposed capsule module are consistently outperforming the respective baseline systems in terms of answer grounding while achieving comparable performance on VQA task.

[Paper] [Supplementary] [Presentation Video] [Poster]

Qualitative Results



We use tensorflow 1.15.0, cuda version 10.1, with python 3.6.12 for our experiments.

We recommend creating a conda environment to install libraries. Follow instructions from SNMN for SNMN, and MAC code repos to setup the environments.


for MAC

First, clone this project repo.

git clone

Go to root directory.

cd WeakGroundedVQA_Capsules

For mac-capsules, go to mac-capsules directory.

cd mac-capsules

mac-capsules/requirements.txt file contains the conda environment packages used for MAC-Capsules.

Inside mac-capsules directory, run the following to create a new environment named "tf15".

conda create --name tf_gpu15 tensorflow-gpu=1.15
conda activate tf_gpu15
pip install -r requirements.txt

We build upon SNMN and MAC and thank them to provide awesome code repos.


We use two datasets in this work: GQA and CLEVR-Answers


CLEVR-Answers is an extended version of CLEVR dataset for evaluation on answer grounding task. We used the CLEVR dataset generation framework to generate new questions with the bounding box labels for the answers. Each data sample now consists of question, image, answer label and bounding box labels for answer objects. We provide these labels for CLEVR training and validation sets. We call this dataset CLEVR-Answers and can be downloaded from here.

Following is the file structure for CLEVR-Answers:


To have a standard train-val-test setup, we separate 1K training images with 10K question-answer pairs for validation of hyperparameters. We call this set "train-val". The original validation set is used as test set in all our experiments.

The split of training data into "new-train" and "train-val" is provided here. todo: add file format description


GQA dataset can be downloaded from here. We used the balanced version of GQA for our experiments. GQA provides the bounding box annotations for both question and answer objects. We evaluate grounding on this dataset for different grounding ground truths: Question (Q), full answer (FA), short answer (A), and both question and answer objects (All). The bboxes information for each groundtruth type is saved in the same format as CLEVR-Answers. These files can be downloaded from this link.

Following is the file structure for GQA:


Format description for the grounding bounding box files such as gqa_val_question_question2bboxes.json

The files in the format gqa_val_<grounding_label_type>_question2bboxes.json are ground truth object boxes saved for each qid. grounding_label_type can be one of the following: all, question, answer, full_answer. It basically tells which objects we want to evaluate grounding for. For more details, see caption of table 2 in the main paper.

These files are obtained after processing gqa questions and scene_graphs information and follow the following format:

{qid1: {obj_id1: [x1, y1, w, h],
        obj_id2: [x1, y1, w, h],

 qid2: {...},


We integrate our capsule module into two baselines: SNMN and MAC. MAC network was trained on both CLEVR-Answers and GQA datasets.


Code for MAC-Caps is shared under directory mac-capsules. We report our best results on GQA with 32 capsules.

download GQA features

cd data
cd ../
python --name spatial 

Download data for GQA balanced split and copy it under the mac-capsules/data/ folder:

cd data
cd ../

Download GQA data files from here and copy them in the mac-capsules/data/ folder.


Run the following command to start training MAC-Capsules with 32 capsules for network length 4 on gqa dataset:

python --expName "gqaExperiment-Spatial-32-capsules-4t" --train --testedNum 10000  --epochs 25 --netLength 4 @configs/gqa/gqa_spatial.txt --writeDim 544   --NUM_VIS_CAPS_L1 32 --NUM_VIS_CAPS_L2 32

--WriteDim depends on the number of capsules. --NUM_VIS_CAPS_L1 denotes the number of primary capsules. --NUM_VIS_CAPS_L2 denotes the number of visual capsules. For all experiments, we keep the same number of capsules in primary layer and visual capsule layer i.e., NUM_VIS_CAPS_L1==NUM_VIS_CAPS_L2==C. --WriteDim therefore is calculated as Cx(KxK+1), where K is the pose dim with pose matrix of size KxK; Activations denote the additional dimension.

for C=16, --writeDim=16x17=272
for C=24, --writeDim=24x17=408
for C=32, --writeDim=32x17=544


python --expName "gqaExperiment-Spatial-32-capsules-4t" --finalTest --test --testAll --netLength 4 -r --getPreds --getAtt @configs/gqa/gqa_spatial.txt 
--writeDim 544 --NUM_VIS_CAPS_L1 32 --NUM_VIS_CAPS_L2 32

Testing on custom dataset

Follow instructions here to test on your custom dataset.

Grounding Evaluation

To generate detections from attention maps produced from MAC-network or MAC-Caps, follow the instructions here. Todo:

  • Grounding evaluation code
  • Instructions for MAC-Capsules-clevrAnswers
  • SNMN-Capsules


If this work and/or dataset is useful for your research, please cite our paper.

    author    = {Urooj, Aisha and Kuehne, Hilde and Duarte, Kevin and Gan, Chuang and Lobo, Niels and Shah, Mubarak},
    title     = {Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {8465-8474}


Please contact ''