Author: 陈晓雪
Dec 23, 2019: added 20 papers and the C-SVT dataset, and updated corresponding tables. You can download the new Excel prepared by us. (Password: teqv)
- IIIT5K[31]:
- Introduction: It contains 5,000 images in total, 2,000 for training and 3,000 for testing. Every image is associated with a 50-word lexicon and a 1000-word lexicon. The lexicon consists of a ground truth and some randomly picked words.
- Link: IIIT5K-download
- SVT[1]:
- Introduction: It contains 647 cropped word images. Many images are severely corrupted by noise, blur, and low resolution. SVT was collected from the Google Street View, and every image is associated with a 50-word lexicon. Specifically, it only provides word-level annotations.
- Link: SVT-download
- ICDAR 2003(IC03)[33]:
- Introduction: It contains 509 images in total, 258 for training and 251 for testing. Specifically, it contains 867 cropped word images after discarding images that contain non-alphanumeric characters or those have less than three characters. Every image is associated with a 50-word lexicon and a full-word lexicon. The full lexicon combines all lexicon words.
- Link: IC03-download
- ICDAR 2013(IC13)[34]:
- Introduction: It contains 1,015 cropped word images and inherits most of its samples from IC03. No lexicon is associated with this dataset.
- Link: IC13-download
- COCO-Text[38]:
- Introduction: It contains 63,686 images in total. Specifically, it contains 145,859 cropped word images for testing, including handwritten and printed, clear and blur, English and non-English.
- Link: COCO-Text-download
- SVHN[45]:
- Introduction: It contains more than 600,000 digits of house numbers in natural scenes. The images were collected from the Google View images, and were used to digit recognition.
- Link: SVHN-download
- SVT-P[35]:
- Introduction: It contains 639 cropped word images for testing. Images were selected from the side-view angle snapshots in Google Street View. Therefore, most images are heavily distorted by the non-frontal view angle. Every image is associated with a 50-word lexicon and a full-word lexicon.
- Link: SVT-P-download (Password : vnis)
- CUTE80[36]:
- Introduction: It contains 80 high-resolution images taken in natural scenes. Specifically, it contains 288 cropped word images for testing. The dataset focuses on curved text. No lexicon is provided.
- Link: CUTE80-download
- ICDAR 2015(IC15)[37]:
- Introduction: It contains 1,500 images in total, 1,000 for training and 500 for testing. Specifically, it contains 2,077 cropped images including more than 200 irregular text. No lexicon is associated with this dataset.
- Link: IC15-download
- Total-Text[39]:
- Introduction: It contains 1,555 images in total. Specifically, it contains 11,459 cropped word images with more than three different text orientations: horizontal, multi-oriented and curved.
- Link: Total-Text-download
- RCTW-17(RCTW competition,ICDAR17)[40]:
- Introduction: It contains 12,514 images in total, 11,514 for training and 1,000 for testing. Images in RCTW-17 were mostly collected by camera or mobile phone, and others were generated images. Text instances are annotated with parallelograms. It is the first large scale Chinese dataset, and was also the largest published one by then.
- Link: RCTW-17-download
- MTWI(competition)[41]:
- Introduction: It contains 20,000 images. The dataset mainly consists of Chinese or English web text. The competition includes three tasks: web text recognition, web text detection and end-to-end web text detection and recognition.
- Link: MTWI-download (Password:gox9)
- CTW[42]:
- Introduction: It contains 32,285 high resolution street view images of Chinese text, with 1,018,402 character instances in total. All images are annotated at the character level, including its underlying character type, bounding box, and 6 other attributes. These attributes indicate whether its background is complex, whether it’s raised, whether it’s hand-written or printed, whether it’s occluded, whether it’s distorted, whether it uses word-art.
- Link: CTW-download
- SCUT-CTW1500[43]:
- Introduction: It contains 1,500 images in total, 1,000 for training and 500 for testing. Specifically, it contains 10,751 cropped word images for testing. Annotations in SCUT-CTW1500 are polygons with 14 vertexes. The dataset mainly consists of Chinese and English.
- Link: SCUT-CTW1500-download
- LSVT(LSVT competition, ICDAR2019)[57]:
- Introduction: It contains 20,000 testing data, 30,000 training data in full annotations and 400,000 training data in weak annotations, which are referred to as partial labels. For most of the training data in weak labels, only one transcription per image is provided. All the images were captured from streets, which consist of a large variety of complicated real-world scenarios, e.g., store fronts and landmarks.
- Link: LSVT-download
- ArT(ArT competition, ICDAR2019)[58]:
- Introduction: It contains 10,166 images in total, 5,603 for training and 4,563 for testing. ArT is a combination of Total-Text, SCUT-CTW1500 and Baidu Curved Scene Text, which were collected with the motive of introducing the arbitrary-shaped text problem to the scene text community. The ArT dataset was collected with text shape diversity, hence all existing text shapes (i.e. horizontal, multi-oriented, and curved) have high number of existence in the dataset.
- Link: ArT-download
- ReCTS(ReCTS competition, ICDAR2019)[59]:
- Introduction: A practical and challenging multi-orientation natural scene text dataset (ReCTS) was collected with 25,000 images, which consist of lots of signboards. In the dataset, all text lines and characters are labeled with locations and character codes.
- Link: ReCTS-download
- Chinese Street View Text(C-SVT) [63]:
- Introduction: It contains more than 430,000 street view images in total, including 30,000 fully annotated images with locations and text labels for the regions and 400,000 more images in which only the annotations of the text-of-interest are given. It is the largest one compared with existing Chinese text reading datasets.
- Synth90k [53] :
- Introduction: It contains 8 million cropped word images generated from a set of 90k common English words. Words are rendered onto natural images with random transformations and effects. Every image is annotated with a ground-truth word.
- Link: Synth90k-download
- SynthText [54] :
- Introduction: It contains 6 million cropped word images. The generation process is similar to that of Synth90k.
- Link: SynthText-download
Comparison of Datasets | ||||||||||||||
Datasets | Language | Images | Lexicon | Label | Type | |||||||||
Pictures | Instances | Training Pictures | Training Instances | Testing Pictures | Testing Instances | 50 | 1k | Full | None | Char | Word | |||
IIIT5K[31] | English | 1120 | 5000 | 380 | 2000 | 740 | 3000 | √ | √ | × | √ | √ | √ | Regular |
SVT[32] | English | 350 | 725 | 100 | 211 | 250 | 514 | √ | × | × | √ | × | √ | Regular |
IC03[33] | English | 509 | 2268 | 258 | 1157 | 251 | 1111 | √ | √ | √ | √ | √ | √ | Regular |
IC13[34] | English | 561 | 5003 | 420 | 3564 | 141 | 1439 | × | × | × | √ | √ | √ | Regular |
COCO-Text[38] | English | 63686 | 145859 | 43686 | 118309 | 10000 | 27550 | × | × | × | √ | × | √ | Regular |
SVHN[45] | Digits | 600000 | 600000 | 573968 | 573968 | 26032 | 26032 | × | × | × | √ | √ | √ | Regular |
SVT-P[35] | English | 238 | 639 | - | - | 238 | 639 | √ | × | √ | √ | × | √ | Irregular |
CUTE80[36] | English | 80 | 288 | - | - | 80 | 288 | × | × | × | √ | × | √ | Irregular |
IC15[37] | English | 1500 | - | 1000 | - | 500 | 2077 | × | × | × | √ | × | √ | Irregular |
Total-Text[39] | English | 1555 | 11459 | 1255 | - | 300 | - | × | × | × | √ | × | √ | Irregular |
RCTW-17[40] | Chinese/English | 12514 | - | 11514 | - | 1000 | - | × | × | × | √ | × | √ | Regular |
MTWI[41] | Chinese/English | 20000 | - | 10000 | - | 10000 | - | × | × | × | √ | × | √ | Regular |
CTW[42] | Chinese/English | 32285 | 1018402 | 25887 | 812872 | 3269 | 103519 | × | × | × | √ | √ | √ | Regular |
SCUT-CTW1500[43] | Chinese/English | 1500 | 10751 | 1000 | - | 500 | - | × | × | × | √ | × | √ | Irregular |
LSVT[57] | Chinese/English | 450000 | - | 30000 | - | 20000 | - | × | × | × | √ | × | √ | Irregular |
ArT[58] | Chinese/English | 10166 | - | 5603 | - | 4563 | - | × | × | × | √ | × | √ | Irregular |
ReCTS[59] | Chinese/English | 25000 | - | - | - | - | - | × | × | × | √ | √ | √ | Irregular |
Synth90k[53] | English | 8000000 | - | - | - | - | - | × | × | × | √ | × | √ | Regular |
SynthText[54] | English | 6000000 | - | - | - | - | - | × | × | × | √ | × | √ | Regular |
C-SVT(full annotations)[63] | Chinese | 29966 | 908305 | 20157 | 620368 | 4841 | 143849 | × | × | × | √ | × | √ | Irregular |
It is notable that 1) "Reg" stands for regular scene text datasets. 2) "Irreg" stands for irregular scene text datasets. 3) "Seg" denotes the method based on segmentation. 4) "Extra" means the method uses the extra datasets in addition to Synth90k and SynthText. 5) "CTC" represents the method applies the CTC-based algorithm to decode. 6) "Attn" represents the method applies the attention mechanism to decode.
You can also download the new Excel prepared by us. (Password: teqv)
Comparison of methods | ||||||||||
Method | Code | Regular | Irregular | Segmentation | Extra data | CTC | Attention | Source | Time | Highlight |
Wang et al. [1] : ABBYY | √ | √ | × | √ | × | × | × | ICCV | 2011 | A state-of-the-art text detector + a leading commercial OCR engine |
Wang et al. [1] : SYNTH+PLEX | √ | √ | × | × | × | × | × | ICCV | 2011 | The baseline of scene text recognition. |
Mishra et al. [2] | × | √ | × | √ | × | × | × | BMVC | 2012 | 1) Incorporating higher order statistical language models to recognize words in an unconstrained manner. 2) Introducing IIIT5K-word dataset. |
Wang et al. [3] | √ | √ | × | √ | × | × | × | ICPR | 2012 | CNNs + Non-maximal suppression + beam search |
Goel et al. [4] : wDTW | × | √ | × | √ | × | × | × | ICDAR | 2013 | Recognizing the text in the image by matching the scene and synthetic image features with wDTW. |
Bissacco et al. [5] : PhotoOCR | × | √ | × | √ | × | × | × | ICCV | 2013 | Applying a network with five hidden layers for character classification. |
Phan et al. [6] | × | × | √ | √ | × | × | × | ICCV | 2013 | 1) MSER + SIFT descriptors + SVM 2) Introducing the SVT-P datasets. |
Alsharif et al. [7] : HMM/Maxout | × | √ | × | √ | × | × | × | ICLR | 2014 | Convolutional Maxout networks + Hybrid HMM |
Almazan et al [8] : KCSR | √ | √ | × | × | × | × | × | TPAMI | 2014 | Embedding word images and text string in a common vectorial subspace and allowing one to cast recognition and retrieval tasks as a nearest neighbor problem. |
Yao et al. [9] : Strokelets | × | √ | × | √ | × | × | × | CVPR | 2014 | Proposing a novel multi-scale representation for scene text recognition: strokelets. |
R.-Serrano et al.[10] : Label embedding | × | √ | × | × | × | × | × | IJCV | 2015 | Embedding word labels and word images into a common Euclidean space and finding the cloest word label in this space. |
Jaderberg et al. [11] | √ | √ | × | √ | × | × | × | ECCV | 2014 | 1) Enabling efficient feature sharing for text detection and classification. 2) Making technical changes over the traditional CNN architectures. 3) Proposing a method of automated data mining of Flickr. |
Su and Lu [12] | × | √ | × | × | × | √ | × | ACCV | 2014 | HOG + BLSTM + CTC |
Gordo[13] : Mid-features | × | √ | × | √ | × | × | × | CVPR | 2015 | Proposing to learn local mid-level features suitable for building word image representations. |
Jaderberg et al. [14] | √ | √ | × | × | × | × | × | IJCV | 2015 | 1) Treating each word as a category and training very large convolutional neural networks to perform word recognition on the whole proposal region. 2) Generating 9 million images, with equal numbers of word samples from a 90k word dictionary. |
Jaderberg et al. [15] | × | √ | × | × | × | × | × | ICLR | 2015 | CNN + CRF |
Shi, Bai, and Yao [16] : CRNN | √ | √ | × | × | × | √ | × | TPAMI | 2017 | CNN + BLSTM + CTC |
Shi et al. [17] : RARE | × | × | √ | × | × | × | √ | CVPR | 2016 | STN + CNN + Attentional BLSTM |
Lee and Osindero [18] : R2AM | × | √ | × | × | × | × | √ | CVPR | 2016 | Presenting recursive recurrent neural networks with attention modeling. |
Liu et al. [19] : STAR-Net | × | × | √ | × | × | √ | × | BMVC | 2016 | STN + ResNet + BLSTM + CTC |
Liu et al. [78] | × | √ | × | √ | √ | × | × | ICPR | 2016 | integrating the CNN and WFST classification model |
Mishra et al. [77] | × | √ | × | √ | √ | × | × | CVIU | 2016 | character detection (HOG/CNN + SVM +Sliding window) + CRF, combining bottom-up cues from character detections and top-down cues from lexicon. |
Su and Lu [76] | × | √ | × | × | √ | √ | × | PR | 2017 | HOG(different scale) + BLSTM + CTC (ensemble) |
*Yang et al. [20] | × | × | √ | × | √ | × | √ | IJCAI | 2017 | 1) CNN + 2D-Attention-based RNN, applying an auxiliary dense character detection task that helps to learn text specific visual patterns. 2) Developing a large-scale synthetic dataset. |
Yin et al. [21] | × | √ | × | × | × | √ | × | ICCV | 2017 | CNN + CTC |
Wang et al.[66] : GRCNN | √ | √ | × | × | × | √ | × | NIPS | 2017 | Gated Recurrent Convulution Layer + BLSTM + CTC |
*Cheng et al. [22] : FAN | × | √ | × | × | √ | × | √ | ICCV | 2017 | 1) Proposing the concept of attention drift. 2)Introducing focusing network to focus deviated attention back on the target areas. |
Cheng et al. [23] : AON | × | × | √ | × | × | × | √ | CVPR | 2018 | 1) Extracting scene text features in four directions. 2)CNN + Attentional BLSTM |
Gao et al. [24] | × | √ | × | × | × | √ | √ | arXiv | 2017 | Attentional ResNet + CNN + CTC |
Liu et al. [25] : Char-Net | × | × | √ | √ | × | × | √ | AAAI | 2018 | CNN + STN (facilitating the rectification of individual characters) + LSTM |
*Liu et al. [26] : SqueezedText | × | √ | × | × | √ | × | × | AAAI | 2018 | Binary convolutional encoder-decoder network + Bi-RNN |
Zhan et al.[73] | √ | √ | × | × | √ | √ | × | CVPR | 2018 | CRNN, achieving verisimilar scene text image synthesis by combining three novel designs including semantic coherence, visual attention and adaptive text appearance. |
*Bai et al. [27] : EP | × | √ | × | × | √ | × | √ | CVPR | 2018 | Proposing edit probability to effectively handle the misalignment between the training text and the output probability distribution sequence. |
Fang et al.[74] | √ | √ | × | × | × | × | √ | MultiMedia | 2018 | ResNet + [2D Attentional CNN, CNN-based language module] |
Liu et al.[75] : EnEsCTC | √ | √ | × | × | × | √ | × | NIPS | 2018 | proposing a novel maximum entropy based regularization for CTC(EnCTC) and an entropy-based pruning method(EsCTC) to effectively reduce the space of the feasible set. |
Liu et al. [28] | × | √ | × | × | × | √ | × | ECCV | 2018 | Designing a multi-task network with an encoder-discriminator-generator architecture to guide the feature of the original image toward that of the clean image. |
Wang et al.[61] : MAAN | × | √ | × | × | × | × | √ | ICFHR | 2018 | ResNet + BLSTM + Memory-Augmented Attentional Decoder |
Gao et al. [29] | × | √ | × | × | × | √ | √ | ICIP | 2018 | Attentional DenseNet + BLSTM + CTC |
Shi et al. [30] : ASTER | √ | × | √ | × | × | × | √ | TPAMI | 2018 | TPS + ResNet + Bidirectional attention-based BLSTM |
Chen et al. [60] : ASTER + AEG | × | × | √ | × | × | × | √ | NC | 2019 | TPS + ResNet + Bidirectional attention-based BLSTM + AEG |
Luo et al. [46] : MORAN | √ | × | √ | × | × | × | √ | PR | 2019 | Multi-object rectification network + CNN + Attentional BLSTM |
Luo et al. [61] : MORAN-v2 | √ | × | √ | × | × | × | √ | PR | 2019 | Multi-object rectification network + ResNet + Attentional BLSTM |
Chen et al. [60] : MORAN-v2 + AEG | × | × | √ | × | × | × | √ | NC | 2019 | Multi-object rectification network + ResNet + Attentional BLSTM + AEG |
Xie et al. [47] : CAN | × | √ | × | × | × | × | √ | ACM | 2019 | ResNet + CNN + GLU |
*Liao et al.[48] : CA-FCN | × | × | √ | √ | √ | × | √ | AAAI | 2019 | Performing character classification at each pixel location and needing character-level annotations. |
*Li et al. [49] : SAR | √ | × | √ | × | √ | × | √ | AAAI | 2019 | ResNet + 2D Attentional LSTM |
Zhan el at. [55]: ESIR | × | × | √ | × | × | × | √ | CVPR | 2019 | Iterative rectification Network + ResNet + Attentional BLSTM |
Zhang et al. [56]: SSDAN | × | √ | × | √ | × | × | √ | CVPR | 2019 | Attentional CNN + GAS + GRU |
Yang et al. [62]: ScRN | × | × | √ | × | √ | × | √ | ICCV | 2019 | Symmetry-constrained Rectification Network + ResNet + BLSTM + Attentional GRU |
Wang et al. [64]: GCAM | × | √ | × | × | × | × | √ | ICME | 2019 | Convolutional Block Attention Module (CBAM) + ResNet + BLSTM + the proposed Gated Cascade Attention Module (GCAM) |
Jeonghun et al. [65] | √ | × | √ | × | × | × | √ | ICCV | 2019 | TPS + ResNet + BLSTM + Attentional Mechanism |
Huang et al. [67] : EPAN | × | × | √ | × | × | × | √ | NC | 2019 | learning to sample features from the text region of 2D feature maps, and innovatively introducing a two-stage attention mechanism |
Gao et al. [68] | × | √ | × | × | × | √ | × | NC | 2019 | Attentional DenseNET + 4-layer CNN + CTC |
Qi et al. [69] : CCL | × | √ | × | × | √ | √ | × | ICDAR | 2019 | ResNet + [CTC, CCL] |
Wang et al. [70] : ReELFA | × | × | √ | × | √ | × | √ | ICDAR | 2019 | VGG + Attentional LSTM, utilizing one-hot encoded coordinates to indicate the spatial relationship of pixels and character center masks to help focus attention on the right feature areas. |
Zhu et al. [71] : HATN | × | × | √ | × | √ | × | √ | ICIP | 2019 | ResNet50 + Hierarchical Attention Mechanism (Transformer structure) |
Zhan et al. [72] : SF-GAN | × | √ | × | × | √ | × | √ | CVPR | 2019 | ResNet50 + Attentional Decoder, synthesising realistic scene text image for training better recognition models. |
Liao et al. [79] : SAM | √ | × | √ | × | × | × | √ | TPAMI | 2019 | Spatial attentional module (SAM) |
Liao et al. [79] : seg-SAM | √ | × | √ | × | √ | × | √ | TPAMI | 2019 | Character segmentation module + Spatial attention module (SAM) |
Wang et al. [80] : DAN | √ | × | √ | × | × | × | √ | AAAI | 2020 | decoupling the decoder of the traditional attention mechanism into a convolutional alignment module and a decoupled text decoder |
In this section, we list the results on different scene text recognition benchmarks, including IIIT5K,SVT,IC03,IC13,SVT-P,CUTE80,IC15,RCTW-17, MWTI, CTW,SCUT-CTW1500, LSVT, ArT and ReCTS.
It is notable that 1) The '*' indicates the methods use the extra datasets in addition to Synth90k and SynthText. 2) The bold represents the best recognition results. 3) '^' denotes the best recognition results of using the extra datasets. 4) '@' represents the methods under different evaluation which only uses 1811 test images. 5) 'SK', 'ST', 'ExPu', 'ExPr' and 'Un' indicates the methods use Synth90K, SynthText, Extra Public Data, Extra Private Data and unknown data, respectively. 6) 'D_A' means data augmentation.
Recognition Results on Regular Dataset | |||||||||||||
Method | IIIT5K | SVT | IC03 | IC13 | Data | Source | Time | ||||||
50 | 1K | None | 50 | None | 50 | Full | 50k | None | None | ||||
Wang et al. [1] : ABBYY | 24.3 | - | - | 35.0 | - | 56.0 | 55.0 | - | - | - | Un | ICCV | 2011 |
Wang et al. [1] : SYNTH+PLEX | - | - | - | 57.0 | - | 76.0 | 62.0 | - | - | - | ExPr | ICCV | 2011 |
Mishra et al. [2] | 64.1 | 57.5 | - | 73.2 | - | 81.8 | 67.8 | - | - | - | ExPu | BMVC | 2012 |
Wang et al. [3] | - | - | - | 70.0 | - | 90.0 | 84.0 | - | - | - | ExPr | ICPR | 2012 |
Goel et al. [4] : wDTW | - | - | - | 77.3 | - | 89.7 | - | - | - | - | Un | ICDAR | 2013 |
Bissacco et al. [5] : PhotoOCR | - | - | - | 90.4 | 78.0 | - | - | - | - | 87.6 | ExPr | ICCV | 2013 |
Phan et al. [6] | - | - | - | 73.7 | - | 82.2 | - | - | - | - | ExPu | ICCV | 2013 |
Alsharif et al. [7] : HMM/Maxout | - | - | - | 74.3 | - | 93.1 | 88.6 | 85.1 | - | - | ExPu | ICLR | 2014 |
Almazan et al [8] : KCSR | 88.6 | 75.6 | - | 87.0 | - | - | - | - | - | - | ExPu | TPAMI | 2014 |
Yao et al. [9] : Strokelets | 80.2 | 69.3 | - | 75.9 | - | 88.5 | 80.3 | - | - | - | ExPu | CVPR | 2014 |
R.-Serrano et al.[10] : Label embedding | 76.1 | 57.4 | - | 70.0 | - | - | - | - | - | - | ExPu | IJCV | 2015 |
Jaderberg et al. [11] | - | - | - | 86.1 | - | 96.2 | 91.5 | - | - | - | ExPu | ECCV | 2014 |
Su and Lu [12] | - | - | - | 83.0 | - | 92.0 | 82.0 | - | - | - | ExPu | ACCV | 2014 |
Gordo[13] : Mid-features | 93.3 | 86.6 | - | 91.8 | - | - | - | - | - | - | ExPu | CVPR | 2015 |
Jaderberg et al. [14] | 97.1 | 92.7 | - | 95.4 | 80.7 | 98.7 | 98.6 | 93.3 | 93.1 | 90.8 | ExPr | IJCV | 2015 |
Jaderberg et al. [15] | 95.5 | 89.6 | - | 93.2 | 71.7 | 97.8 | 97.0 | 93.4 | 89.6 | 81.8 | SK + ExPr | ICLR | 2015 |
Shi, Bai, and Yao [16] : CRNN | 97.8 | 95.0 | 81.2 | 97.5 | 82.7 | 98.7 | 98.0 | 95.7 | 91.9 | 89.6 | SK | TPAMI | 2017 |
Shi et al. [17] : RARE | 96.2 | 93.8 | 81.9 | 95.5 | 81.9 | 98.3 | 96.2 | 94.8 | 90.1 | 88.6 | SK | CVPR | 2016 |
Lee and Osindero [18] : R2AM | 96.8 | 94.4 | 78.4 | 96.3 | 80.7 | 97.9 | 97.0 | - | 88.7 | 90.0 | SK | CVPR | 2016 |
Liu et al. [19] : STAR-Net | 97.7 | 94.5 | 83.3 | 95.5 | 83.6 | 96.9 | 95.3 | - | 89.9 | 89.1 | SK + ExPr | BMVC | 2016 |
*Liu et al. [78] | 94.1 | 84.7 | - | 92.5 | - | 96.8 | 92.2 | - | - | - | ExPu (D_A) | ICPR | 2016 |
*Mishra et al. [77] | 78.1 | - | 46.7 | 78.2 | - | 88.0 | - | - | 67.7 | 60.2 | ExPu (D_A) | CVIU | 2016 |
*Su and Lu [76] | - | - | - | 91.0 | - | 95.0 | 89.0 | - | - | 76.0 | SK + ExPu | PR | 2017 |
*Yang et al. [20] | 97.8 | 96.1 | - | 95.2 | - | 97.7 | - | - | - | - | ExPu | IJCAI | 2017 |
Yin et al. [21] | 98.7 | 96.1 | 78.2 | 95.1 | 72.5 | 97.6 | 96.5 | - | 81.1 | 81.4 | SK | ICCV | 2017 |
Wang et al.[66] : GRCNN | 98.0 | 95.6 | 80.8 | 96.3 | 81.5 | 98.8 | 97.8 | - | 91.2 | - | SK | NIPS | 2017 |
*Cheng et al. [22] : FAN | 99.3 | 97.5 | 87.4 | 97.1 | 85.9 | 99.2 | 97.3 | - | 94.2 | 93.3 | SK + ST (Pixel_wise) | ICCV | 2017 |
Cheng et al. [23] : AON | 99.6 | 98.1 | 87.0 | 96.0 | 82.8 | 98.5 | 97.1 | - | 91.5 | - | SK + ST (D_A) | CVPR | 2018 |
Gao et al. [24] | 99.1 | 97.9 | 81.8 | 97.4 | 82.7 | 98.7 | 96.7 | - | 89.2 | 88.0 | SK | arXiv | 2017 |
Liu et al. [25] : Char-Net | - | - | 83.6 | - | 84.4 | - | 93.3 | - | 91.5 | 90.8 | SK (D_A) | AAAI | 2018 |
*Liu et al. [26] : SqueezedText | 97.0 | 94.1 | 87.0 | 95.2 | - | 98.8 | 97.9 | 93.8 | 93.1 | 92.9 | ExPr | AAAI | 2018 |
*Zhan et al.[73] | 98.1 | 95.3 | 79.3 | 96.7 | 81.5 | - | - | - | - | 87.1 | Pr(5 million) | CVPR | 2018 |
*Bai et al. [27] : EP | 99.5 | 97.9 | 88.3 | 96.6 | 87.5 | 98.7 | 97.9 | - | 94.6 | 94.4 | SK + ST (Pixel_wise) | CVPR | 2018 |
Fang et al.[74] | 98.5 | 96.8 | 86.7 | 97.8 | 86.7 | 99.3 | 98.4 | - | 94.8 | 93.5 | SK + ST | MultiMedia | 2018 |
Liu et al.[75] : EnEsCTC | - | - | 82.0 | - | 80.6 | - | - | - | 92.0 | 90.6 | SK | NIPS | 2018 |
Liu et al. [28] | 97.3 | 96.1 | 89.4 | 96.8 | 87.1 | 98.1 | 97.5 | - | 94.7 | 94.0 | SK | ECCV | 2018 |
Wang et al.[61] : MAAN | 98.3 | 96.4 | 84.1 | 96.4 | 83.5 | 97.4 | 96.4 | - | 92.2 | 91.1 | SK | ICFHR | 2018 |
Gao et al. [29] | 99.1 | 97.2 | 83.6 | 97.7 | 83.9 | 98.6 | 96.6 | - | 91.4 | 89.5 | SK | ICIP | 2018 |
Shi et al. [30] : ASTER | 99.6 | 98.8 | 93.4 | 97.4 | 89.5 | 98.8 | 98.0 | - | 94.5 | 91.8 | SK + ST | TPAMI | 2018 |
Chen et al. [60] : ASTER + AEG | 99.5 | 98.5 | 94.4 | 97.4 | 90.3 | 99.0 | 98.3 | - | 95.2 | 95.0 | SK + ST | NC | 2019 |
Luo et al. [46] : MORAN | 97.9 | 96.2 | 91.2 | 96.6 | 88.3 | 98.7 | 97.8 | - | 95.0 | 92.4 | SK + ST | PR | 2019 |
Luo et al. [61] : MORAN-v2 | - | - | 93.4 | - | 88.3 | - | - | - | 94.2 | 93.2 | SK + ST | PR | 2019 |
Chen et al. [60] : MORAN-v2 + AEG | 99.5 | 98.7 | 94.6 | 97.4 | 90.4 | 98.8 | 98.3 | - | 95.3 | 95.3 | SK + ST | NC | 2019 |
Xie et al. [47] : CAN | 97.0 | 94.2 | 80.5 | 96.9 | 83.4 | 98.4 | 97.8 | - | 91.0 | 90.5 | SK | ACM | 2019 |
*Liao et al.[48] : CA-FCN | ^99.8 | 98.9 | 92.0 | 98.8 | 82.1 | - | - | - | - | 91.4 | SK + ST+ ExPr | AAAI | 2019 |
*Li et al. [49] : SAR | 99.4 | 98.2 | 95.0 | 98.5 | 91.2 | - | - | - | - | 94.0 | SK + ST + ExPr | AAAI | 2019 |
Zhan el at. [55]: ESIR | 99.6 | 98.8 | 93.3 | 97.4 | 90.2 | - | - | - | - | 91.3 | SK + ST | CVPR | 2019 |
Zhang et al. [56]: SSDAN | - | - | 83.8 | - | 84.5 | - | - | - | 92.1 | 91.8 | SK | CVPR | 2019 |
*Yang et al. [62]: ScRN | 99.5 | 98.8 | 94.4 | 97.2 | 88.9 | 99.0 | 98.3 | - | 95.0 | 93.9 | SK + ST(char-level + word-level) | ICCV | 2019 |
Wang et al. [64]: GCAM | - | - | 93.9 | - | 91.3 | - | - | - | 95.3 | 95.7 | SK + ST | ICME | 2019 |
Jeonghun et al. [65] | - | - | 87.9 | - | 87.5 | - | - | - | 94.4 | 92.3 | SK + ST | ICCV | 2019 |
Huang et al. [67]:EPAN | 98.9 | 97.8 | 94.0 | 96.6 | 88.9 | 98.7 | 98.0 | - | 95.0 | 94.5 | SK + ST | NC | 2019 |
Gao et al. [68] | 99.1 | 97.9 | 81.8 | 97.4 | 82.7 | 98.7 | 96.7 | - | 89.2 | 88.0 | SK | NC | 2019 |
*Qi et al. [69] : CCL | 99.6 | 99.1 | 91.1 | 98.0 | 85.9 | 99.2 | ^98.8 | - | 93.5 | 92.8 | SK + ST(char-level + word-level) | ICDAR | 2019 |
*Wang et al. [70] : ReELFA | 99.2 | 98.1 | 90.9 | - | 82.7 | - | - | - | - | - | ST(char-level + word-level) | ICDAR | 2019 |
*Zhu et al. [71] : HATN | - | - | 88.6 | - | 82.2 | - | - | - | 91.3 | 91.1 | SK(D_A) + Pu | ICIP | 2019 |
*Zhan et al. [72] : SF-GAN | - | - | 63.0 | - | 69.3 | - | - | - | - | 61.8 | Pr(1 million) | CVPR | 2019 |
Liao et al. [79] : SAM | 99.4 | 98.6 | 93.9 | 98.6 | 90.6 | 98.8 | 98.0 | - | 95.2 | 95.3 | SK + ST | TPAMI | 2019 |
*Liao et al. [79] : seg-SAM | ^99.8 | ^99.3 | ^95.3 | ^99.1 | ^91.8 | 99.0 | 97.9 | - | 95.0 | 95.3 | SK + ST (char-level) | TPAMI | 2019 |
Wang et al. [80] : DAN | - | - | 94.3 | - | 89.2 | - | - | - | 95.0 | 93.9 | SK + ST | AAAI | 2020 |
Recognition Results on Irregular Datasets | |||||||||
Method | SVT-P | CUTE80 | IC15 | COCO-TEXT | Data | Source | Time | ||
50 | Full | None | None | None | None | ||||
Wang et al. [1] : ABBYY | 40.5 | 26.1 | - | - | - | - | Un | ICCV | 2011 |
Wang et al. [1] : SYNTH+PLEX | - | - | - | - | - | - | ExPr | ICCV | 2011 |
Mishra et al. [2] | 45.7 | 24.7 | - | - | - | - | ExPu | BMVC | 2012 |
Wang et al. [3] | 40.2 | 32.4 | - | - | - | - | ExPr | ICPR | 2012 |
Goel et al. [4] : wDTW | - | - | - | - | - | - | Un | ICDAR | 2013 |
Bissacco et al. [5] : PhotoOCR | - | - | - | - | - | - | ExPr | ICCV | 2013 |
Phan et al. [6] | 62.3 | 42.2 | - | - | - | - | ExPu | ICCV | 2013 |
Alsharif et al. [7] : HMM/Maxout | - | - | - | - | - | - | ExPu | ICLR | 2014 |
Almazan et al [8] : KCSR | - | - | - | - | - | - | ExPu | TPAMI | 2014 |
Yao et al. [9] : Strokelets | - | - | - | - | - | - | ExPu | CVPR | 2014 |
R.-Serrano et al.[10] : Label embedding | - | - | - | - | - | - | ExPu | IJCV | 2015 |
Jaderberg et al. [11] | - | - | - | - | - | - | ExPu | ECCV | 2014 |
Su and Lu [12] | - | - | - | - | - | - | ExPu | ACCV | 2014 |
Gordo[13] : Mid-features | - | - | - | - | - | - | ExPu | CVPR | 2015 |
Jaderberg et al. [14] | - | - | - | - | - | - | ExPr | IJCV | 2015 |
Jaderberg et al. [15] | - | - | - | - | - | - | SK + ExPr | ICLR | 2015 |
Shi, Bai, and Yao [16] : CRNN | - | - | - | - | - | - | SK | TPAMI | 2017 |
Shi et al. [17] : RARE | 91.2 | 77.4 | 71.8 | 59.2 | - | - | SK | CVPR | 2016 |
Lee and Osindero [18] : R2AM | - | - | - | - | - | - | SK | CVPR | 2016 |
Liu et al. [19] : STAR-Net | 94.3 | 83.6 | 73.5 | - | - | - | SK + ExPr | BMVC | 2016 |
*Liu et al. [78] | - | - | - | - | - | - | ExPu (D_A) | ICPR | 2016 |
*Mishra et al. [77] | - | - | - | - | - | - | ExPu (D_A) | CVIU | 2016 |
*Su and Lu [76] | - | - | - | - | - | - | SK + ExPu | PR | 2017 |
*Yang et al. [20] | 93.0 | 80.2 | 75.8 | 69.3 | - | - | ExPu | IJCAI | 2017 |
Yin et al. [21] | - | - | - | - | - | - | SK | ICCV | 2017 |
Wang et al.[66] : GRCNN | - | - | - | - | - | - | SK | NIPS | 2017 |
*Cheng et al. [22] : FAN | - | - | - | - | *85.3 | - | SK + ST (Pixel_wise) | ICCV | 2017 |
Cheng et al. [23] : AON | 94.0 | 83.7 | 73.0 | 76.8 | 68.2 | - | SK + ST (D_A) | CVPR | 2018 |
Gao et al. [24] | - | - | - | - | - | - | SK | arXiv | 2017 |
Liu et al. [25] : Char-Net | - | - | 73.5 | - | 60.0 | - | SK (D_A) | AAAI | 2018 |
*Liu et al. [26] : SqueezedText | - | - | - | - | - | - | ExPr | AAAI | 2018 |
*Zhan et al.[73] | - | - | - | - | - | - | Pr(5 million) | CVPR | 2018 |
*Bai et al. [27] : EP | - | - | - | - | 73.9 | - | SK + ST (Pixel_wise) | CVPR | 2018 |
Fang et al.[74] | - | - | - | - | 71.2 | - | SK + ST | MultiMedia | 2018 |
Liu et al.[75] : EnEsCTC | - | - | - | - | - | - | SK | NIPS | 2018 |
Liu et al. [28] | - | - | 73.9 | 62.5 | - | - | SK | ECCV | 2018 |
Wang et al.[61] : MAAN | - | - | - | - | - | - | SK | ICFHR | 2018 |
Gao et al. [29] | - | - | - | - | - | - | SK | ICIP | 2018 |
Shi et al. [30] : ASTER | - |