/mmgnn_textvqa

A Pytorch implementation of CVPR 2020 paper: Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Primary LanguagePythonOtherNOASSERTION

Multi-Modal GNN for TextVQA

LICENSE Python PyTorch

  1. This project provides codes to reproduce the results of Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text on TextVQA dataset
  2. We are grateful to MMF (or Pythia), an excellent VQA codebase provided by Facebook, on which our codes are developed
  3. We achieved 32.46% accuracy (ensemble) on test set of TextVQA

Requirements

  1. Pytorch 1.0.1 post
  2. We have performed experiments on Maxwell Titan X GPU, which has 12GB of GPU memory
  3. See requirements.txt for the required python packages and run to install them

Let's begin from cloning this repository

$ git clone https://github.com/ricolike/mmgnn-textvqa.git
$ cd mmgnn-textvqa
$ pip install -r requirements.txt

Data Setup

  1. cached data: To boost data loading speed under limited memory size (64G) and to speed up calculation, we cached intermediate dataloader results in storage. Download data (around 54G, and around 120G unzipped), and modify line 11 (fast_dir) in config to the absolute path where you save them
  2. other files: Download other needed files (vocabulary, OCRs, some parameters of backbone) here (less than 1G), and make a soft link named data under repo root towards where you saved them

Training

  1. Create a new model folder under ensemble, say foo, and then copy our config into it
$ mkdir -p ensemble/foo
$ cp ./configs/vqa/textvqa/s_mmgnn.yml ./ensemble/foo
  1. Start training, and parameters will be saved in ensemble/foo
$ python tools/run.py --tasks vqa --datasets textvqa --model s_mmgnn --config ensemble/foo/s_mmgnn.yml -dev cuda:0 --run_type train`
  1. First-run of this repo will automatically download glove in pythia/.vector_cache, let's be patient. If we made it, we will find a s_mmgnnbar_final.pth in the model folder ensemble/foo

Inference

  1. If you want to skip training procedure, a trained model is provided on which we can directly do inference
  2. Start inference by running the following command. And if you made it, you will find three new files generated under the model folder, two ends with _evailai.p are ready to be submitted to evalai to check the results
$ python tools/run.py --tasks vqa --datasets textvqa --model s_mmgnn --config ensemble/bar/s_mmgnn.yml --resume_file <path_to_pth> -dev cuda:0 --run_type all_in_one

Bibtex

@article{gao2020multi,
  title={Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text},
  author={Gao, Difei and Li, Ke and Wang, Ruiping and Shan, Shiguang and Chen, Xilin},
  journal={arXiv preprint arXiv:2003.13962},
  year={2020}
}

An attention visualization


Question: "What is the name of the bread sold at the place?"
Answer: "Panera"
(where white box is the answer predicted, green boxes are OCRs Panera attends to, and red boxes are visual ROIs Panera attends to; box weight indicating attention strength)