Awesome Text VQA

Text related VQA is a fine-grained direction of the VQA task, which only focuses on the question that requires to read the textual content shown in the input image.

Datasets

Dataset #Train+Val Img #Train+Val Que #Test Img #Test Que Image Source Language
Text-VQA 25,119 39,602 3,353 5,734 [1] EN
ST-VQA 19,027 26,308 2,993 4,163 [2, 3, 4, 5, 6, 7, 8] EN
OCR-VQA 186,775 901,717 20,797 100,429 [9] EN
EST-VQA 20,757 23,062 4,482 5,000 [4, 5, 8, 10, 11, 12, 13] EN+CH
DOC-VQA 11,480 44,812 1,287 5,188 [14] EN+CH

Image Source:
[1] OpenImages: A public dataset for large-scale multi-label and multi-class image classification (v3) [dataset]
[2] Imagenet: A large-scale hierarchical image database [dataset]
[3] Vizwiz grand challenge: Answering visual questions from blind people [dataset]
[4] ICDAR 2013 robust reading competition [dataset]
[5] ICDAR 2015 competition on robust reading [dataset]
[6] Visual Genome: Connecting language and vision using crowdsourced dense image annotations [dataset]
[7] Image retrieval using textual cues [dataset]
[8] Coco-text: Dataset and benchmark for text detection and recognition in natural images [dataset]
[9] Judging a book by its cover [dataset]
[10] Total Text [dataset]
[11] SCUT-CTW1500 [dataset]
[12] MLT [dataset]
[13] Chinese Street View Text [dataset]
[14] UCSF Industry Document Library [dataset]

Related Challenges

Document Visual Question AnsweringCVPR 2020 Workshop on Text and Documents in the Deep Learning Era Submission Deadline: 30 April 2020 [Challenge]

Papers

2020

  • [SA-M4C] Spatially Aware MultimodalTransformers for TextVQA (ECCV) [Paper]
  • [EST-VQA] On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering (CVPR) [Paper]
  • [M4C] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA (CVPR) [Paper][Project]
  • [LaAP-Net] Finding the Evidence: Localization-aware Answer Prediction for TextVisual Question Answering (COLING) [Paper]

2019

  • [Text-VQA/LoRRA] Towards VQA Models That Can Read (CVPR) [Paper][Code]
  • [ST-VQA] Scene Text Visual Question Answering (ICCV) [Paper]
  • [Text-KVQA] From Strings to Things: Knowledge-enabled VQA Modelthat can Read and Reason (ICCV) [Paper]
  • [OCR-VQA] OCR-VQA: Visual Question Answering by Reading Text in Images (ICDAR) [Paper]

Technical Reports

  • [SMA] Structured Multimodal Attentions for TextVQA [Report][Slides][Video]
  • [DiagNet] DiagNet: Bridging Text and Image [Report][Code]
  • [DCD_ZJU] Winner of 2019 Text-VQA challenge [Slides]
  • [Schwail] Runner-up of 2019 Text-VQA challenge [Slides]

Benchmark

Acc. : Accuracy I. E. : Image Encoder Q. E. : Question Encoder O. E. : OCR Token Encoder Ensem. : Ensemble

Text-VQA

[official leaderboard(2019)] [official leaderboard(2020)]

Y-C./J. Methods Acc. I. E. Q. E. OCR O. E. Output Ensem.
2019--CVPR LoRRA 26.64 Faster R-CNN GloVe Rosetta-ml FastText Classification N
2019--N/A DCD_ZJU 31.44 Faster R-CNN BERT Rosetta-ml FastText Classification Y
2020--CVPR M4C 40.46 Faster R-CNN (ResNet-101) BERT Rosetta-en FastText Decoder N
2020--Challenge Xiangpeng 40.77
2020--Challenge colab_buaa 44.73
2020--Challenge CVMLP(SAM) 44.80
2020--Challenge NWPU_Adelaide_Team(SMA) 45.51 Faster R-CNN BERT BDN Graph Attention Decoder N
2020--ECCV SA-M4C 44.6* Faster R-CNN (ResNext-152) BERT Google-OCR FastText+PHOC Decoder N

* Using external data for training.

ST-VQA

[official leaderboard]
T1 : Strongly Contextualised Task T2 : Weakly Contextualised Task T3 : Open Dictionary

Y-C./J. Methods Acc. (T1/T2/T3) I. E. Q. E. OCR O. E. Output Ensem.
2020--CVPR M4C na/na/0.4621 Faster R-CNN (ResNet-101) BERT Rosetta-en FastText Decoder N
2020--Challenge SMA 0.5081/0.3104/0.4659 Faster BERT BDN Graph Attention Decoder N
2020--ECCV SA-M4C na/na/0.5042 Faster R-CNN (ResNext-152) BERT Google-OCR FastText+PHOC Decoder N

OCR-VQA

Y-C./J. Methods Acc. I. E. Q. E. OCR O. E. Output Ensem.
2020--CVPR M4C 63.9 Faster R-CNN BERT Rosetta-en FastText Decoder N