1. Datasets

1.1 Introduction

  • SVT [16]:

    • Introduction: There are 100 training images and 250 testing images downloaded from Google Street View of road-side scenes. The labelled text can be very challenging with a wide variety of fonts, orientations, and lighting conditions. A lexicon containing 50 words (SVT-50) is also provided for each image.
    • Link: SVT-download
  • ICDAR 2003(IC03) [17]:

    • Introduction: The dataset contains a varied array of photos of the world that contain scene text. There are 251 testing images with 50 word lexicons (IC03-50) and a lexicon of all test groundtruth words (IC03-Full).
    • Link: IC03-download
  • ICDAR 2011(IC11) [18] :

    • Introduction: The dataset is an extension to the dataset used for the text locating competitions of ICDAR 2003.It includes 485 natural images in total.
    • Link: IC11-download
  • ICDAR 2013(IC13) [19]:

    • Introduction: The dataset consists of 229 training images and 233 testing images. Most text are horizontal. Three specific lexicons are provided, named as “Strong(S)”, “Weak(W)” and “Generic(G)”. “Strong(S)” lexicon provides 100 words per-image including all words that appear in the image. “Weak(W)” lexicon includes all words that appear in the entire test set. And “Generic(G)” lexicon is a 90k word vocabulary.
    • Link: IC13-download
  • ICDAR 2015(IC15) [20]:

    • Introduction: The dataset includes 1000 training images and 500 testing images captured by Google glasses. The text in the scene is in arbitrary orientations. Similar to ICDAR 2013, it also provides “Strong(S)”, “Weak(W)” and “Generic(G)” lexicons.
    • Link: IC15-download
  • Total-Text [21]:

    • Introduction: Except for the horizontal text and oriented text, Total-Text also consists of a lot of curved text. Total-Text contains 1255 training images and 300 test images. All images are annotated with polygons and transcriptions in word-level. A “Full” lexicon contains all words in test set is provided.
    • Link: Total-Text-download

1.2 Comparison of Datasets

Comparison of Datasets
Datasets Language Image Text instance Text Shape Annotation level Lexicon
Total Train Test Total Train Test Horizontal Arbitrary-Quadrilateral Multi-oriented Char Word Text-Line 50 1k Full None
IC03 English 509 258 251 2266 1110 1156
IC11 English 484 229 255 1564
IC13 English 462 229 233 1944 849 1095
SVT English 350 100 250 725 211 514
SVT-P English 238 639
IC15 English 1500 1000 500 17548 122318 5230
Total-Text English 1525 1225 300 9330

2. Summary of End-to-end Scene Text Detection and Recognition Methods

2.1 Comparison of methods

      Method          Model    Code             Detection                   Recognition       Source Time                               Highlight                              
Wang et al. [1]
Sliding windows and Random Ferns Pictorial Structures ICCV 2011 Word Re-scoring for NMS
Wang et al. [2]
CNN-based Sliding windows for classification ICPR 2012 CNN architecture
Jaderberg et al. [3]
CNN-based and saliency maps CNN classifier ECCV 2014 Data mining and annotation
Alsharif et al. [4]
CNN and hybrid HMM maxout models Segmentation-based ICLR 2014 Hybrid HMM maxout models
Yao et al. [5]
Random Forest Component Linking and Word Partition TIP 2014 (1) Detection and recognition features sharing. (2) Oriented-text. (3) A new dictionary search method
Neumann et al. [6]
Extremal Regions Clustering algorithm to group characters TPAMI 2015 Real-time performance(1.6s/image)
Jaderberg et al. [7]
Region proposal mechanism Word-level classification IJCV 2016 Trained only on data produced by a synthetic text generation engine, requiring no human labelled data
Liao et al. [8] TextBoxes SSD-based framework CRNN AAAI 2017 An end-to-end trainable fast scene text detector
Bŭsta et al. [9] Deep TextSpotter Yolo v2 CTC ICCV 2017 Yolov2 + RPN, RNN + CTC. It is the first end-to-end trainable detection and recognition system with high speed.
Li et al. [10]
Text Proposal Network Attention ICCV 2017 TPN + RNN encoder + attention-based RNN
Lyu et al. [11] Mask TextSpotter Fast R-CNN with mask branch Character segmentation ECCV 2018 Precise text detection and recognition are acquired via semantic segmentation
He et al. [12]
Text-Alignment Layer Attention CVPR 2018 Character attention mechanism: use character spatial information as explicit supervision
Liu et al. [13] FOTS EAST with RoIRotate CTC CVPR 2018 Little computation overhead compared to baseline text detection network (22.6fps)
Liao et al. [14] TextBoxes++ SSD-based framework CRNN TIP 2018 Journal version of TextBoxes (multi-oriented scene text support)
Liao et al. [15] Mask TextSpotter Mask RCNN Character segmentation + Spatial Attention Module TPAMI 2019 Journal version of Mask TextSpotter(proposes Spatial Attention Module)

2.2 End-to-end scene text detection and recognition results

      Method      Model Source Time SVT SVT-50 IC03 IC11 IC13 IC15 Total-text
End-to-end Spotting End-to-end Spotting None Full
50 Full None S W G S W G S W G S W G
Wang et al. [1]
ICCV 2011 ~ ~ 51
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Wang et al. [2]
ICPR 2012 46 ~ 72 67 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Jaderberg et al. [3]
ECCV 2014 ~ 56 80 75 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Alsharif et al. [4]
ICLR 2014 ~ 48 77 70 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Yao et al. [5]
TIP 2014 ~ ~ ~ ~
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Neumann et al. [6]
TPAMI 2015
68.1 ~ ~ ~ ~ 45.2 ~ ~ ~ ~ ~ 35 19.9 15.6 35 19.9 15.6 ~ ~
Jaderberg et al. [7]
IJCV 2016 53 76 90 86 78 76 76 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Liao et al. [8] TextBoxes AAAI 2017 64 84 ~ ~ ~ 87 91 89 84 94 92 87 ~ ~ ~ ~ ~ ~ 36.3 48.9
Bŭsta et al. [9] Deep TextSpotter ICCV 2017 ~ ~ ~ ~ ~ ~ 89 86 77 92 89 81 54 51 47 58 53 51 ~ ~
Li et al. [10]
ICCV 2017 66.18 84.91 ~ ~ ~ 87.7 ~ ~ ~ ~ ~ ~ 91.08 89.8 84.6 94.2 92.4 88.2 ~ ~
Lyu et al. [11] Mask TextSpotter ECCV 2018 ~ ~ ~ ~ ~ ~ 92.2 91.1 86.5 92.5 92 88.2 79.3 73 62.4 79.3 74.5 64.2 52.9 71.8
He et al. [12]
CVPR 2018 ~ ~ ~ ~ ~ ~ 91 89 86 93 92 87 82 77 63 85 80 65 ~ ~
Liu et al. [13] FOTS CVPR 2018 ~ ~ ~ ~ ~ ~ 91.99 90.11 84.77 95.94 93.9 87.76 83.55 79.11 65.33 87.01 82.39 67.97 ~ ~
Liao et al. [14] TextBoxes++ TIP 2018 64 84 ~ ~ ~ ~ 93 92 85 96 95 87 73.3 65.9 51.9 76.5 69 54.4 ~ ~
Liao et al. [15] Mask TextSpotter TPAMI 2019 ~ ~ ~ ~ ~ ~ 93.3 91.3 88.2 92.7 91.7 87.7 83 77.7 73.5 82.4 78.1 73.6 65.3 77.4

3. Survey

[A] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[B] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[C] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper

4. OCR Service

5. References and codes

If you find any problems in our resources, or any good papers/codes we have missed, please inform us at liuchongyu1996@gmail.com. Thank you for your contribution.


