Data and code for NeurIPS 2021 Paper "IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning".
We propose a new challenging benchmark, icon question answering (IconQA), which aims to highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world diagram word problems. For this benchmark, we build up a large-scale IconQA dataset that consists of three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. Compared to existing VQA benchmarks, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning.
There are three different sub-tasks in IconQA:
- 57,672 multi-image-choice questions
- 31,578 multi-text-choice questions
- 18,189 filling-in-the-blank questions
Sub-Tasks | Train | Validation | Test | Total |
---|---|---|---|---|
Multi-image-choice | 34,603 | 11,535 | 11,535 | 57,672 |
Multi-text-choice | 18,946 | 6,316 | 6,316 | 31,578 |
Filling-in-the-blank | 10,913 | 3,638 | 3,638 | 18,189 |
We further develop a strong model, Patch-TRM, which parses the diagram in a pyramid layout and applies cross-modal Transformers to learn the joint diagram-question feature. Patch-TRM takes patches parsed from a hierarchical pyramid layout and embeds them through ResNet pre-trained on our Icon645 dataset. The joint diagram-question feature is learned via cross-modal Transformers followed by the attention module.
For more details, you can find our website here and our paper here.
You can download IconQA here or from Google Drive, then unzip the dataset into root_dir/data
.
Next, download pre-trained models here or from Google Drive, then unzip them into root_dir
.
Or run the command by:
. tools/download_data_and_models.sh
python=3.6.9
h5py=3.1.0
huggingface-hub=0.0.12
numpy=1.19.5
Pillow=8.3.1
torch=1.9.0+cu111
torchvision=0.10.0+cu111
tqdm=4.61.2
Install all required python dependencies:
pip install -r requirements.txt
Generate the question dictionary:
cd tools
python create_dictionary.py
Generate answer labels:
python create_ans_label.py
Generate the image patch features from the icon classifier model that is pre-trained on our proposed Icon645 dataset:
python generate_img_patch_feature.py --icon_pretrained True --patch_split 79
--icon_pretrained True
: the backbone network is pre-trained on icon data--patch_split 79
: the image is hierarchically parsed into 79 patches before feature extraction
Generate the image choice features for the multi-image-choice
sub-task from the icon classifier model that is pre-trained on our proposed Icon645 dataset:
python generate_img_choice_feature.py --icon_pretrained True
--icon_pretrained True
: the backbone network is pre-trained on icon data
Optionally, you can set --icon_pretrained False
to generate image features from the ResNet101 model pre-trained on natural image dataset ImageNet.
The above steps are time-consuming and can take several hours. Instead, you can alternatively download the extracted features here or from Google Drive, then unzip them into root_dir/data
. Or run the command by:
. tools/download_img_feats.sh
Before moving on, please check the following directories:
data/
├── dictionary.pkl
├── iconqa_data
│ └── iconqa
│ ├── test
│ ├── train
│ └── val
├── img_choice_embeddings
│ └── resnet101_pool5_icon
│ ├── iconqa_test_choose_img_resnet101_pool5_icon.pth
│ ├── iconqa_train_choose_img_resnet101_pool5_icon.pth
│ └── iconqa_val_choose_img_resnet101_pool5_icon.pth
├── patch_embeddings
│ └── resnet101_pool5_79_icon
│ ├── iconqa_test_choose_img_resnet101_pool5_79_icon.pth
│ ├── iconqa_test_choose_txt_resnet101_pool5_79_icon.pth
│ ├── iconqa_test_fill_in_blank_resnet101_pool5_79_icon.pth
│ ├── iconqa_train_choose_img_resnet101_pool5_79_icon.pth
│ ├── iconqa_train_choose_txt_resnet101_pool5_79_icon.pth
│ ├── iconqa_train_fill_in_blank_resnet101_pool5_79_icon.pth
│ ├── iconqa_val_choose_img_resnet101_pool5_79_icon.pth
│ ├── iconqa_val_choose_txt_resnet101_pool5_79_icon.pth
│ └── iconqa_val_fill_in_blank_resnet101_pool5_79_icon.pth
├── pid_splits.json
├── problems.json
├── trainval_choose_img_ans2label.pkl
├── trainval_choose_img_label2ans.pkl
├── trainval_choose_txt_ans2label.pkl
├── trainval_choose_txt_label2ans.pkl
├── trainval_fill_in_blank_ans2label.pkl
└── trainval_fill_in_blank_label2ans.pkl
saved_models/
├── choose_img
│ └── exp_paper
│ └── best_model.pth
├── choose_txt
│ └── exp_paper
│ └── best_paper.pth
├── fill_in_blank
│ └── exp_paper
│ └── best_paper.pth
└── icon_classification_ckpt
└── icon_resnet101_LDAM_DRW_lr0.01_0
└── ckpt.epoch66_best.pth.tar
Train the Patch_TRM model for the filling-in-the-blank sub-task:
cd run_fill_in_blank
python train.py --model patch_transformer_ques_bert --label exp0
Evaluate the Patch_TRM model for the filling-in-the-blank sub-task:
python eval.py --model patch_transformer_ques_bert --label exp0
Or, you can evaluate the Patch_TRM model for the filling-in-the-blank sub-task with our trained model:
python eval.py --model patch_transformer_ques_bert --label exp_paper
Train the Patch_TRM model for the multi-text-choice sub-task:
cd run_choose_txt
python train.py --model patch_transformer_ques_bert --label exp0
Evaluate the Patch_TRM model for the multi-text-choice sub-task:
python eval.py --model patch_transformer_ques_bert --label exp0
Or, you can evaluate the Patch_TRM model for the multi-text-choice sub-task with our trained model:
python eval.py --model patch_transformer_ques_bert --label exp_paper
Train the Patch_TRM model for the multi-image-choice sub-task:
cd run_choose_img
python train.py --model patch_transformer_ques_bert --label exp0
Evaluate the Patch_TRM model for the multi-image-choice sub-task:
python eval.py --model patch_transformer_ques_bert --label exp0
Or, you can evaluate the Patch_TRM model for the multi-image-choice sub-task with our trained model:
python eval.py --model patch_transformer_ques_bert --label exp_paper
Calculate the accuracies over different skills based on result json files reported in the paper:
cd tools
python sub_acc.py \
--fill_in_blank_result exp_patch_transformer_ques_bert.json \
--choose_txt_result exp_patch_transformer_ques_bert.json \
--choose_img_result exp_patch_transformer_ques_bert.json
Calculate the accuracies over different skills based on user-specified result json files:
python sub_acc.py \
--fill_in_blank_result exp0_patch_transformer_ques_bert.json \
--choose_txt_result exp0_patch_transformer_ques_bert.json \
--choose_img_result exp0_patch_transformer_ques_bert.json
In addition to IconQA, we also present Icon645, a large-scale dataset of icons that cover a wide range of objects:
- 645,687 colored icons
- 377 different icon classes (class mapping is stored in icon645_classes.json)
These collected icon classes are frequently mentioned in the IconQA questions. In this work, we use the icon data to pre-train backbone networks on the icon classification task in order to extract semantic representations from abstract diagrams in IconQA. On top of pre-training encoders, the large-scale icon data could also contribute to open research on abstract aesthetics and symbolic visual understanding.
You can download Icon645 here or from Google Drive. Or run the command by:
cd data
wget https://iconqa2021.s3.us-west-1.amazonaws.com/icon645.zip
unzip icon645.zip
File structures for the Icon645 dataset:
icon645
| LICENCE.md
| metadata.json
└───colored_icons_final
|
└───acorn
| | image_id1.png
| | image_id2.png
| | ...
|
└───airplane
| | image_id3.png
| | ...
|
| ...
If the paper or the dataset inspires you, please cite us:
@inproceedings{lu2021iconqa,
title = {IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning},
author = {Lu, Pan and Qiu, Liang and Chen, Jiaqi and Xia, Tony and Zhao, Yizhou and Zhang, Wei and Yu, Zhou and Liang, Xiaodan and Zhu, Song-Chun},
booktitle = {The 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks},
year = {2021}
}
Our dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.