Text related VQA is a fine-grained direction of the VQA task, which only focuses on the question that requires to read the textual content shown in the input image.
- EST-VQA dataset (CVPR 2020) [Project][Paper]
- DOC-VQA dataset (CVPR Workshop 2020) [Project][Paper]
- Text-VQA dataset (CVPR 2019) [Project][Paper]
- ST-VQA dataset (ICCV 2019) [Project][Paper]
- OCR-VQA dataset (ICDAR 2019) [Project][Paper]
Dataset | #Train+Val Img | #Train+Val Que | #Test Img | #Test Que | Image Source | Language |
---|---|---|---|---|---|---|
Text-VQA | 25,119 | 39,602 | 3,353 | 5,734 | [1] | EN |
ST-VQA | 19,027 | 26,308 | 2,993 | 4,163 | [2, 3, 4, 5, 6, 7, 8] | EN |
OCR-VQA | 186,775 | 901,717 | 20,797 | 100,429 | [9] | EN |
EST-VQA | 20,757 | 23,062 | 4,482 | 5,000 | [4, 5, 8, 10, 11, 12, 13] | EN+CH |
DOC-VQA | 11,480 | 44,812 | 1,287 | 5,188 | [14] | EN+CH |
Image Source:
[1] OpenImages: A public dataset for large-scale multi-label and multi-class image classification (v3) [dataset]
[2] Imagenet: A large-scale hierarchical image database [dataset]
[3] Vizwiz grand challenge: Answering visual questions from blind people [dataset]
[4] ICDAR 2013 robust reading competition [dataset]
[5] ICDAR 2015 competition on robust reading [dataset]
[6] Visual Genome: Connecting language and vision using crowdsourced dense image annotations [dataset]
[7] Image retrieval using textual cues [dataset]
[8] Coco-text: Dataset and benchmark for text detection and recognition in natural images [dataset]
[9] Judging a book by its cover [dataset]
[10] Total Text [dataset]
[11] SCUT-CTW1500 [dataset]
[12] MLT [dataset]
[13] Chinese Street View Text [dataset]
[14] UCSF Industry Document Library [dataset]
Document Visual Question Answering (CVPR 2020 Workshop on Text and Documents in the Deep Learning Era Submission Deadline: 30 April 2020 [Challenge]
- [SA-M4C] Spatially Aware MultimodalTransformers for TextVQA (ECCV) [Paper]
- [EST-VQA] On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering (CVPR) [Paper]
- [M4C] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA (CVPR) [Paper][Project]
- [LaAP-Net] Finding the Evidence: Localization-aware Answer Prediction for TextVisual Question Answering (COLING) [Paper]
- [Text-VQA/LoRRA] Towards VQA Models That Can Read (CVPR) [Paper][Code]
- [ST-VQA] Scene Text Visual Question Answering (ICCV) [Paper]
- [Text-KVQA] From Strings to Things: Knowledge-enabled VQA Modelthat can Read and Reason (ICCV) [Paper]
- [OCR-VQA] OCR-VQA: Visual Question Answering by Reading Text in Images (ICDAR) [Paper]
- [SMA] Structured Multimodal Attentions for TextVQA [Report][Slides][Video]
- [DiagNet] DiagNet: Bridging Text and Image [Report][Code]
- [DCD_ZJU] Winner of 2019 Text-VQA challenge [Slides]
- [Schwail] Runner-up of 2019 Text-VQA challenge [Slides]
Acc. : Accuracy I. E. : Image Encoder Q. E. : Question Encoder O. E. : OCR Token Encoder Ensem. : Ensemble
[official leaderboard(2019)] [official leaderboard(2020)]
Y-C./J. | Methods | Acc. | I. E. | Q. E. | OCR | O. E. | Output | Ensem. |
---|---|---|---|---|---|---|---|---|
2019--CVPR | LoRRA | 26.64 | Faster R-CNN | GloVe | Rosetta-ml | FastText | Classification | N |
2019--N/A | DCD_ZJU | 31.44 | Faster R-CNN | BERT | Rosetta-ml | FastText | Classification | Y |
2020--CVPR | M4C | 40.46 | Faster R-CNN (ResNet-101) | BERT | Rosetta-en | FastText | Decoder | N |
2020--Challenge | Xiangpeng | 40.77 | ||||||
2020--Challenge | colab_buaa | 44.73 | ||||||
2020--Challenge | CVMLP(SAM) | 44.80 | ||||||
2020--Challenge | NWPU_Adelaide_Team(SMA) | 45.51 | Faster R-CNN | BERT | BDN | Graph Attention | Decoder | N |
2020--ECCV | SA-M4C | 44.6* | Faster R-CNN (ResNext-152) | BERT | Google-OCR | FastText+PHOC | Decoder | N |
* Using external data for training.
[official leaderboard]
T1 : Strongly Contextualised Task
T2 : Weakly Contextualised Task
T3 : Open Dictionary
Y-C./J. | Methods | Acc. (T1/T2/T3) | I. E. | Q. E. | OCR | O. E. | Output | Ensem. |
---|---|---|---|---|---|---|---|---|
2020--CVPR | M4C | na/na/0.4621 | Faster R-CNN (ResNet-101) | BERT | Rosetta-en | FastText | Decoder | N |
2020--Challenge | SMA | 0.5081/0.3104/0.4659 | Faster | BERT | BDN | Graph Attention | Decoder | N |
2020--ECCV | SA-M4C | na/na/0.5042 | Faster R-CNN (ResNext-152) | BERT | Google-OCR | FastText+PHOC | Decoder | N |
Y-C./J. | Methods | Acc. | I. E. | Q. E. | OCR | O. E. | Output | Ensem. |
---|---|---|---|---|---|---|---|---|
2020--CVPR | M4C | 63.9 | Faster R-CNN | BERT | Rosetta-en | FastText | Decoder | N |