Scene Text Recognition Resources

Author: 陈晓雪

Datasets
Summary of Scene Text Recognition Results
1. 2.1 Comparison of methods
2. 2.2 Recognition Results
Survey
OCR Service
References
Help
Copyright

1. Datasets

1.1 Regular Scene Text Datasets

IIIT5K[31]：
- Introduction: It contains 5,000 images in total, 2,000 for training and 3,000 for testing. Every image is associated with a 50-word lexicon and a 1000-word lexicon. The lexicon consists of a ground truth and some randomly picked words.
- Link: IIIT5K-download
SVT[1]：
- Introduction: It contains 647 cropped word images. Many images are severely corrupted by noise, blur, and low resolution. SVT was collected from the Google Street View, and every image is associated with a 50-word lexicon. Specifically, it only provides word-level annotations.
- Link: SVT-download
ICDAR 2003(IC03)[33]：
- Introduction: It contains 509 images in total, 258 for training and 251 for testing. Specifically, it contains 867 cropped word images after discarding images that contain non-alphanumeric characters or those have less than three characters. Every image is associated with a 50-word lexicon and a full-word lexicon. The full lexicon combines all lexicon words.
- Link: IC03-download
ICDAR 2013(IC13)[34]：
- Introduction: It contains 1,015 cropped word images and inherits most of its samples from IC03. No lexicon is associated with this dataset.
- Link: IC13-download
COCO-Text[38]：
- Introduction: It contains 63,686 images in total. Specifically, it contains 145,859 cropped word images for testing, including handwritten and printed, clear and blur, English and non-English.
- Link: COCO-Text-download
SVHN[45]：
- Introduction: It contains more than 600,000 digits of house numbers in natural scenes. The images were collected from the Google View images, and were used to digit recognition.
- Link: SVHN-download

1.2 Irregular Scene Text Datasets

SVT-P[35]：
- Introduction: It contains 639 cropped word images for testing. Images were selected from the side-view angle snapshots in Google Street View. Therefore, most images are heavily distorted by the non-frontal view angle. Every image is associated with a 50-word lexicon and a full-word lexicon.
- Link: SVT-P-download (Password : vnis)
CUTE80[36]：
- Introduction: It contains 80 high-resolution images taken in natural scenes. Specifically, it contains 288 cropped word images for testing. The dataset focuses on curved text. No lexicon is provided.
- Link: CUTE80-download
ICDAR 2015(IC15)[37]：
- Introduction: It contains 1,500 images in total, 1,000 for training and 500 for testing. Specifically, it contains 2,077 cropped images including more than 200 irregular text. No lexicon is associated with this dataset.
- Link: IC15-download
Total-Text[39]：
- Introduction: It contains 1,555 images in total. Specifically, it contains 11,459 cropped word images with more than three different text orientations: horizontal, multi-oriented and curved.
- Link: Total-Text-download

1.3 Bilingual Scene Text Datasets (mainly in Chinese and English)

RCTW-17(RCTW competition，ICDAR17)[40]：
- Introduction: It contains 12,514 images in total, 11,514 for training and 1,000 for testing. Images in RCTW-17 were mostly collected by camera or mobile phone, and others were generated images. Text instances are annotated with parallelograms. It is the first large scale Chinese dataset, and was also the largest published one by then.
- Link: RCTW-17-download
MTWI(competition)[41]：
- Introduction: It contains 20,000 images. The dataset mainly consists of Chinese or English web text. The competition includes three tasks: web text recognition, web text detection and end-to-end web text detection and recognition.
- Link: MTWI-download (Password:gox9)
CTW[42]：
- Introduction: It contains 32,285 high resolution street view images of Chinese text, with 1,018,402 character instances in total. All images are annotated at the character level, including its underlying character type, bounding box, and 6 other attributes. These attributes indicate whether its background is complex, whether it’s raised, whether it’s hand-written or printed, whether it’s occluded, whether it’s distorted, whether it uses word-art.
- Link: CTW-download
SCUT-CTW1500[43]：
- Introduction: It contains 1,500 images in total, 1,000 for training and 500 for testing. Specifically, it contains 10,751 cropped word images for testing. Annotations in SCUT-CTW1500 are polygons with 14 vertexes. The dataset mainly consists of Chinese and English.
- Link: SCUT-CTW1500-download

LSVT(LSVT competition, ICDAR2019)[57]:
- Introduction: It contains 20,000 testing data, 30,000 training data in full annotations and 400,000 training data in weak annotations, which are referred to as partial labels. For most of the training data in weak labels, only one transcription per image is provided. All the images were captured from streets, which consist of a large variety of complicated real-world scenarios, e.g., store fronts and landmarks.
- Link: LSVT-download
ArT(ArT competition, ICDAR2019)[58]:
- Introduction: It contains 10,166 images in total, 5,603 for training and 4,563 for testing. ArT is a combination of Total-Text, SCUT-CTW1500 and Baidu Curved Scene Text, which were collected with the motive of introducing the arbitrary-shaped text problem to the scene text community. The ArT dataset was collected with text shape diversity, hence all existing text shapes (i.e. horizontal, multi-oriented, and curved) have high number of existence in the dataset.
- Link: ArT-download
ReCTS(ReCTS competition, ICDAR2019)[59]:
- Introduction: A practical and challenging multi-orientation natural scene text dataset (ReCTS) was collected with 25,000 images, which consist of lots of signboards. In the dataset, all text lines and characters are labeled with locations and character codes.
- Link: ReCTS-download

1.4 Synthetic Datasets

Synth90k [53] :
- Introduction: It contains 8 million cropped word images generated from a set of 90k common English words. Words are rendered onto natural images with random transformations and effects. Every image is annotated with a ground-truth word.
- Link: Synth90k-download
SynthText [54] :
- Introduction: It contains 6 million cropped word images. The generation process is similar to that of Synth90k.
- Link: SynthText-download

1.5 Comparison of Datasets

Comparison of Datasets

Datasets

Language

Images

Lexicon

Label

Type

Pictures

Instances

Training Pictures

Training Instances

Testing Pictures

Testing Instances

Full

None

Char

Word

IIIT5K[31]

English

1120

5000

380

2000

740

3000

√

Regular

SVT[1]

English

350

725

100

211

250

514

√

Regular

IC03[33]

English

509

2268

258

1157

251

1111

√

Regular

IC13[34]

English

561

5003

420

3564

141

1439

√

Regular

COCO-Text[38]

English

63686

145859

43686

118309

10000

27550

√

Regular

SVHN[45]

Digits

600000

573968

26032

√

Regular

SVT-P[35]

English

238

639

238

639

√

Irregular

CUTE80[36]

English

288

√

Irregular

IC15[37]

English

1500

1000

500

2077

√

Irregular

Total-Text[39]

English

1555

11459

1255

300

√

Irregular

RCTW-17[40]

Chinese/English

12514

11514

1000

√

Regular

MTWI[41]

Chinese/English

20000

10000

√

Regular

CTW[42]

Chinese/English

32285

1018402

25887

812872

3269

103519

√

Regular

SCUT-CTW1500[43]

Chinese/English

1500

10751

1000

500

√

Irregular

LSVT[57]

Chinese/English

450000

30000

20000

√

Irregular

ArT[58]

Chinese/English

10166

5603

4563

√

Irregular

ReCTS[59]

Chinese/English

25000

√

Irregular

Synth90k[53]

English

8000000

√

Regular

SynthText[54]

English

6000000

√

Regular

2. Summary of Scene Text Recognition Results

2.1 Comparison of methods

It is notable that 1) "Reg" stands for regular scene text datasets. 2) "Irreg" stands for irregular scene text datasets. 3) "Seg" denotes the method based on segmentation. 4) "Extra" means the method uses the extra datasets. 5) "CTC" represents the method applies the CTC-based algorithm to decode. 6) "Attn" represents the method applies the attention mechanism to decode.

You can also download the Excel prepared by us. (Password: 1kwj)

Comparison of methods
Method	Code	Regular	Irregular	Segmentation	Extra data	CTC	Attention	Source	Time	Highlight
Method	Code	Regular	Irregular	Segmentation	Extra data	CTC	Attention	Source	Time	Highlight
Wang et al. [1] : ABBYY	√	√	×	√	×	×	×	ICCV	2011	A state-of-the-art text detector + a leading commercial OCR engine
Wang et al. [1] : SYNTH+PLEX	√	√	×	×	×	×	×	ICCV	2011	The baseline of scene text recognition.
Mishra et al. [2]	×	√	×	√	×	×	×	BMVC	2012	1) Incorporating higher order statistical language models to recognize words in an unconstrained manner. 2) Introducing IIIT5K-word dataset.
Wang et al. [3]	√	√	×	√	×	×	×	ICPR	2012	CNNs + Non-maximal suppression + beam search
Goel et al. [4] : wDTW	×	√	×	√	×	×	×	ICDAR	2013	Recognizing the text in the image by matching the scene and synthetic image features with wDTW.
Bissacco et al. [5] : PhotoOCR	×	√	×	√	×	×	×	ICCV	2013	Applying a network with five hidden layers for character classification.
Phan et al. [6]	×	×	√	√	×	×	×	ICCV	2013	1) MSER + SIFT descriptors + SVM 2) Introducing the SVT-P datasets.
Alsharif et al. [7] : HMM/Maxout	×	√	×	√	×	×	×	ICLR	2014	Convolutional Maxout networks + Hybrid HMM
Almazan et al [8] : KCSR	√	√	×	×	×	×	×	TPAMI	2014	Embedding word images and text string in a common vectorial subspace and allowing one to cast recognition and retrieval tasks as a nearest neighbor problem.
Yao et al. [9] : Strokelets	×	√	×	√	×	×	×	CVPR	2014	Proposing a novel multi-scale representation for scene text recognition: strokelets.
R.-Serrano et al.[10] : Label embedding	×	√	×	×	×	×	×	IJCV	2015	Embedding word labels and word images into a common Euclidean space and finding the closest word label in this space.
Jaderberg et al. [11]	√	√	×	√	×	×	×	ECCV	2014	1) Enabling efficient feature sharing for text detection and classification. 2) Making technical changes over the traditional CNN architectures. 3) Proposing a method of automated data mining of Flickr.
Su and Lu [12]	×	√	×	×	×	√	×	ACCV	2014	HOG + BLSTM + CTC
Gordo[13] : Mid-features	×	√	×	√	×	×	×	CVPR	2015	Proposing to learn local mid-level features suitable for building word image representations.
Jaderberg et al. [14]	√	√	×	×	×	×	×	IJCV	2015	1) Treating each word as a category and training very large convolutional neural networks to perform word recognition on the whole proposal region. 2) Generating 9 million images, with equal numbers of word samples from a 90k word dictionary.
Jaderberg et al. [15]	×	√	×	×	×	×	×	ICLR	2015	CNN + CRF
Shi, Bai, and Yao [16] : CRNN	√	√	×	×	×	√	×	TPAMI	2017	CNN + BLSTM + CTC
Shi et al. [17] : RARE	×	×	√	×	×	×	√	CVPR	2016	STN + CNN + Attentional BLSTM
Lee and Osindero [18] : R2AM	×	√	×	×	×	×	√	CVPR	2016	Presenting recursive recurrent neural networks with attention modeling.
Liu et al. [19] : STAR-Net	×	×	√	×	×	√	×	BMVC	2016	STN + ResNet + BLSTM + CTC
*Yang et al. [20]	×	×	√	×	√	×	√	IJCAI	2017	1) CNN + 2D-Attention-based RNN, applying an auxiliary dense character detection task that helps to learn text specific visual patterns. 2) Developing a large-scale synthetic dataset.
Yin et al. [21]	×	√	×	×	×	√	×	ICCV	2017	CNN + CTC
*Cheng et al. [22] : FAN	×	√	×	×	√	×	√	ICCV	2017	1) Proposing the concept of attention drift. 2)Introducing focusing network to focus deviated attention back on the target areas.
Cheng et al. [23] : AON	×	×	√	×	×	×	√	CVPR	2018	1) Extracting scene text features in four directions. 2)CNN + Attentional BLSTM
Gao et al. [24]	×	√	×	×	×	√	√	arXiv	2017	Attentional ResNet + CNN + CTC
Liu et al. [25] : Char-Net	×	×	√	√	×	×	√	AAAI	2018	CNN + STN (facilitating the rectification of individual characters) + LSTM
*Liu et al. [26] : SqueezedText	×	√	×	×	√	×	×	AAAI	2018	Binary convolutional encoder-decoder network + Bi-RNN
*Bai et al. [27] : EP	×	√	×	×	√	×	√	CVPR	2018	Proposing edit probability to effectively handle the misalignment between the training text and the output probability distribution sequence.
Liu et al. [28]	×	√	×	×	×	√	×	ECCV	2018	Designing a multi-task network with an encoder-discriminator-generator architecture to guide the feature of the original image toward that of the clean image.
Gao et al. [29]	×	√	×	×	×	√	√	ICIP	2018	Attentional DenseNet + BLSTM + CTC
Shi et al. [30] : ASTER	√	×	√	×	×	×	√	TPAMI	2018	TPS + ResNet + Bidirectional attention-based BLSTM
Chen et al. [60] : ASTER + AEG	×	×	√	×	×	×	√	arXiv	2019	TPS + ResNet + Bidirectional attention-based BLSTM + AEG
Luo et al. [46] : MORAN	√	×	√	×	×	×	√	PR	2019	Multi-object rectification network + CNN + Attentional BLSTM
Luo et al. [32] : MORAN-v2	√	×	√	×	×	×	√	PR	2019	Multi-object rectification network + ResNet + Attentional BLSTM
Chen et al. [60] : MORAN-v2 + AEG	×	×	√	×	×	×	√	arXiv	2019	Multi-object rectification network + ResNet + Attentional BLSTM + AEG
Xie et al. [47] : CAN	×	√	×	×	×	×	√	ACM	2019	ResNet + CNN + GLU
*Liao et al.[48] : CA-FCN	×	×	√	√	√	×	√	AAAI	2019	Performing character classification at each pixel location and needing character-level annotations.
*Li et al. [49] : SAR	√	×	√	×	√	×	√	AAAI	2019	ResNet + 2D Attentional LSTM
Zhan el at. [55]: ESIR	×	×	√	×	×	×	√	CVPR	2019	Iterative rectification Network + ResNet + Attentional BLSTM
Zhang et al. [56]: SSDAN	×	√	×	√	×	×	√	CVPR	2019	Attentional CNN + GAS + GRU

2.2 Recognition Results

In this section, we list the results on different scene text recognition benchmarks, including IIIT5K，SVT，IC03，IC13，SVT-P，CUTE80，IC15，RCTW-17, MWTI, CTW，SCUT-CTW1500, LSVT, ArT and ReCTS.

It is notable that 1) The '*' indicates the methods use the extra datasets. 2) The bold represents the best recognition results. 3) '^' denotes the best recognition results of using the extra datasets. 4) '@' represents the methods under different evaluation which only uses 1811 test images. 5) 'SK', 'ST', 'ExPu', 'ExPr' and 'Un' indicates the methods use Synth90K, SynthText, Extra Public Data, Extra Private Data and unknown data, respectively.

2.2.1 Recognition Results on Regular Dataset

Recognition Results on Regular Dataset
Method	IIIT5K			SVT		IC03				IC13	Data	Source	Time
Method	50	1K	None	50	None	50	Full	50k	None	None	Data	Source	Time
Wang et al. [1] : ABBYY	24.3	-	-	35.0	-	56.0	55.0	-	-	-	Un	ICCV	2011
Wang et al. [1] : SYNTH+PLEX	-	-	-	57.0	-	76.0	62.0	-	-	-	ExPr	ICCV	2011
Mishra et al. [2]	64.1	57.5	-	73.2	-	81.8	67.8	-	-	-	ExPu	BMVC	2012
Wang et al. [3]	-	-	-	70.0	-	90.0	84.0	-	-	-	ExPr	ICPR	2012
Goel et al. [4] : wDTW	-	-	-	77.3	-	89.7	-	-	-	-	Un	ICDAR	2013
Bissacco et al. [5] : PhotoOCR	-	-	-	90.4	78.0	-	-	-	-	87.6	ExPr	ICCV	2013
Phan et al. [6]	-	-	-	73.7	-	82.2	-	-	-	-	ExPu	ICCV	2013
Alsharif et al. [7] : HMM/Maxout	-	-	-	74.3	-	93.1	88.6	85.1	-	-	ExPu	ICLR	2014
Almazan et al [8] : KCSR	88.6	75.6	-	87.0	-	-	-	-	-	-	ExPu	TPAMI	2014
Yao et al. [9] : Strokelets	80.2	69.3	-	75.9	-	88.5	80.3	-	-	-	ExPu	CVPR	2014
R.-Serrano et al.[10] : Label embedding	76.1	57.4	-	70.0	-	-	-	-	-	-	ExPu	IJCV	2015
Jaderberg et al. [11]	-	-	-	86.1	-	96.2	91.5	-	-	-	ExPu	ECCV	2014
Su and Lu [12]	-	-	-	83.0	-	92.0	82.0	-	-	-	ExPu	ACCV	2014
Gordo[13] : Mid-features	93.3	86.6	-	91.8	-	-	-	-	-	-	ExPu	CVPR	2015
Jaderberg et al. [14]	97.1	92.7	-	95.4	80.7	98.7	98.6	93.3	93.1	90.8	ExPr	IJCV	2015
Jaderberg et al. [15]	95.5	89.6	-	93.2	71.7	97.8	97.0	93.4	89.6	81.8	SK + ExPr	ICLR	2015
Shi, Bai, and Yao [16] : CRNN	97.8	95.0	81.2	97.5	82.7	98.7	98.0	95.7	91.9	89.6	SK	TPAMI	2017
Shi et al. [17] : RARE	96.2	93.8	81.9	95.5	81.9	98.3	96.2	94.8	90.1	88.6	SK	CVPR	2016
Lee and Osindero [18] : R2AM	96.8	94.4	78.4	96.3	80.7	97.9	97.0	-	88.7	90.0	SK	CVPR	2016
Liu et al. [19] : STAR-Net	97.7	94.5	83.3	95.5	83.6	96.9	95.3	-	89.9	89.1	SK + ExPr	BMVC	2016
*Yang et al. [20]	97.8	96.1	-	95.2	-	97.7	-	-	-	-	ExPu	IJCAI	2017
Yin et al. [21]	98.7	96.1	78.2	95.1	72.5	97.6	96.5	-	81.1	81.4	SK	ICCV	2017
*Cheng et al. [22] : FAN	99.3	97.5	87.4	97.1	85.9	^99.2	97.3	-	94.2	93.3	SK + ST (Pixel_wise)	ICCV	2017
Cheng et al. [23] : AON	99.6	98.1	87.0	96.0	82.8	98.5	97.1	-	91.5	-	SK + ST (D_A)	CVPR	2018
Gao et al. [24]	99.1	97.9	81.8	97.4	82.7	98.7	96.7	-	89.2	88.0	SK	arXiv	2017
Liu et al. [25] : Char-Net	-	-	83.6	-	84.4	-	93.3	-	91.5	90.8	SK (D_A)	AAAI	2018
*Liu et al. [26] : SqueezedText	97.0	94.1	87.0	95.2	-	98.8	97.9	93.8	93.1	92.9	ExPr	AAAI	2018
*Bai et al. [27] : EP	99.5	97.9	88.3	96.6	87.5	98.7	97.9	-	94.6	94.4	SK + ST (Pixel_wise)	CVPR	2018
Liu et al. [28]	97.3	96.1	89.4	96.8	87.1	98.1	97.5	-	94.7	94.0	SK	ECCV	2018
Gao et al. [29]	99.1	97.2	83.6	97.7	83.9	98.6	96.6	-	91.4	89.5	SK	ICIP	2018
Shi et al. [30] : ASTER	99.6	98.8	93.4	97.4	89.5	98.8	98.0	-	94.5	91.8	SK + ST	TPAMI	2018
Chen et al. [60] : ASTER + AEG	99.5	98.5	94.4	97.4	90.3	99.0	98.3	-	95.2	95.0	SK + ST	arXiv	2019
Luo et al. [46] : MORAN	97.9	96.2	91.2	96.6	88.3	98.7	97.8	-	95.0	92.4	SK + ST	PR	2019
Luo et al. [32] : MORAN-v2	-	-	93.4	-	88.3	-	-	-	94.2	93.2	SK + ST	PR	2019
Chen et al. [60] : MORAN-v2 + AEG	99.5	98.7	94.6	97.4	90.4	98.8	98.3	-	95.3	95.3	SK + ST	arXiv	2019
Xie et al. [47] : CAN	97.0	94.2	80.5	96.9	83.4	98.4	97.8	-	91.0	90.5	SK	ACM	2019
*Liao et al.[48] : CA-FCN	^99.8	^98.9	92.0	^98.8	82.1	-	-	-	-	91.4	SK + ST+ ExPr	AAAI	2019
*Li et al. [49] : SAR	99.4	98.2	^95.0	98.5	^91.2	-	-	-	-	94.0	SK + ST + ExPr	AAAI	2019
Zhan el at. [55]: ESIR	99.6	98.8	93.3	97.4	90.2	-	-	-	-	91.3	SK + ST	CVPR	2019
Zhang et al. [56]: SSDAN	-	-	83.8	-	84.5	-	-	-	92.1	91.8	SK	CVPR	2019

2.2.2 Recognition Results on Irregular Dataset

Recognition Results on Irregular Datasets
Method	SVT-P			CUTE80	IC15	COCO-TEXT	Data	Source	Time
Method	50	Full	None	None	None	None	Data	Source	Time
Wang et al. [1] : ABBYY	40.5	26.1	-	-	-	-	Un	ICCV	2011
Wang et al. [1] : SYNTH+PLEX	-	-	-	-	-	-	ExPr	ICCV	2011
Mishra et al. [2]	45.7	24.7	-	-	-	-	ExPu	BMVC	2012
Wang et al. [3]	40.2	32.4	-	-	-	-	ExPr	ICPR	2012
Goel et al. [4] : wDTW	-	-	-	-	-	-	Un	ICDAR	2013
Bissacco et al. [5] : PhotoOCR	-	-	-	-	-	-	ExPr	ICCV	2013
Phan et al. [6]	62.3	42.2	-	-	-	-	ExPu	ICCV	2013
Alsharif et al. [7] : HMM/Maxout	-	-	-	-	-	-	ExPu	ICLR	2014
Almazan et al [8] : KCSR	-	-	-	-	-	-	ExPu	TPAMI	2014
Yao et al. [9] : Strokelets	-	-	-	-	-	-	ExPu	CVPR	2014
R.-Serrano et al.[10] : Label embedding	-	-	-	-	-	-	ExPu	IJCV	2015
Jaderberg et al. [11]	-	-	-	-	-	-	ExPu	ECCV	2014
Su and Lu [12]	-	-	-	-	-	-	ExPu	ACCV	2014
Gordo[13] : Mid-features	-	-	-	-	-	-	ExPu	CVPR	2015
Jaderberg et al. [14]	-	-	-	-	-	-	ExPr	IJCV	2015
Jaderberg et al. [15]	-	-	-	-	-	-	SK + ExPr	ICLR	2015
Shi, Bai, and Yao [16] : CRNN	-	-	-	-	-	-	SK	TPAMI	2017
Shi et al. [17] : RARE	91.2	77.4	71.8	59.2	-	-	SK	CVPR	2016
Lee and Osindero [18] : R2AM	-	-	-	-	-	-	SK	CVPR	2016
Liu et al. [19] : STAR-Net	94.3	83.6	73.5	-	-	-	SK + ExPr	BMVC	2016
*Yang et al. [20]	93.0	80.2	75.8	69.3	-	-	ExPu	IJCAI	2017
Yin et al. [21]	-	-	-	-	-	-	SK	ICCV	2017
*Cheng et al. [22] : FAN	-	-	-	-	*85.3	-	SK + ST (Pixel_wise)	ICCV	2017
Cheng et al. [23] : AON	94.0	83.7	73.0	76.8	68.2	-	SK + ST (D_A)	CVPR	2018
Gao et al. [24]	-	-	-	-	-	-	SK	arXiv	2017
Liu et al. [25] : Char-Net	-	-	73.5	-	60.0	-	SK (D_A)	AAAI	2018
*Liu et al. [26] : SqueezedText	-	-	-	-	-	-	ExPr	AAAI	2018
*Bai et al. [27] : EP	-	-	-	-	73.9	-	SK + ST (Pixel_wise)	CVPR	2018
Liu et al. [28]	-	-	73.9	62.5	-	-	SK	ECCV	2018
Gao et al. [29]	-	-	-	-	-	-	SK	ICIP	2018
Shi et al. [30] : ASTER	-	-	78.5	79.5	76.1	-	SK + ST	TPAMI	2018
Chen et al. [60] : ASTER + AEG	94.4	89.5	82.0	80.9	76.7	-	SK + ST	arXiv	2019
Luo et al. [46] : MORAN	94.3	86.7	76.1	77.4	68.8	-	SK + ST	PR	2019
Luo et al. [32] : MORAN-v2	-	-	79.7	81.9	73.9	-	SK + ST	PR	2019
Chen et al. [60] : MORAN-v2 + AEG	94.7	89.6	82.8	81.3	77.4	-	SK + ST	arXiv	2019
Xie et al. [47] : CAN	-	-	-	-	-	-	SK	ACM	2019
*Liao et al.[48] : CA-FCN	-	-	-	78.1	-	-	SK + ST+ ExPr	AAAI	2019
*Li et al. [49] : SAR	^95.8	^91.2	^86.4	^89.6	^78.8	^66.8	SK + ST + ExPr	AAAI	2019
Zhan el at. [55]: ESIR	-	-	79.6	83.3	76.9	-	SK + ST	CVPR	2019
Zhang et al. [56]: SSDAN	-	-	-	-	-	-	SK	CVPR	2019

2.2.3 Recognition Results on Bilingual Scene Text Dataset

In this section, we only list the top three results of each competition. Please refer to the competition website for more information.

Recognition Results on Bilingual Scene Text Dataset
Method	RCTW_17	MTWI	CTW	LSVT	ArT	ReCTS	Time	Source
Method	RCTW_17	MTWI	CTW	LSVT	ArT	ReCTS	Time	Source
Lv et al. : NLPR PAL	0.3201 (end-to-end)	-	-	-	-	-	2017	RCTW Competition
Jin et al. : SCUT_DLVC	0.2374 (end-to-end)	-	-	-	-	-	2017	RCTW Competition
Dai et al. : CCFLAB	0.2143 (end-to-end)	-	-	-	-	-	2017	RCTW Competition
IFLYTEK : nelslip(iflytek&ustc)	-	85.8 (AR)	-	-	-	-	2018	MTWI Competition
Samsung R&D China, Beijing : SRC-B-MachineLearningLab	-	85.7(AR)	-	-	-	-	2018	MTWI Competition
NetEase : NTAI	-	82.6(AR)	-	-	-	-	2018	MTWI Competition
Yuan et al.[42] : CTW	-	-	80.5 (AR)	-	-	-	2018	CTW
Liu et al. [43] : SCUT-CTW1500	-	-	-	-	-	-	2017	SCUT-CTW1500
Tencent-DPPR Team	-	-	-	66.66 (end-to-end)	-	-	2019	LSVT Competition
HUST VLRGROUP	-	-	-	63.42 (end-to-end)	-	-	2019	LSVT Competition
PMTD	-	-	-	63.36 (end-to-end)	-	-	2019	LSVT Competition
Clova AI OCR Team, NAVER/LINE Corp	-	-	-	-	85.32 (AR)	-	2019	ArT Competition
SenseTime Group	-	-	-	-	85.2 (AR)	-	2019	ArT Competition
USTC-iFLYTEK	-	-	-	-	81.23 (AR)	-	2019	ArT Competition
SCUT, The University of Adelaide,Northwestern Polytechnical University, Lenovo, HUAWEI	-	-	-	-	-	95.55 (AR)	2019	ReCTS Competition
Tencent(Data Platform Precision Recommendation)	-	-	-	-	-	94.86 (AR)	2019	ReCTS Competition
Huazhong University of Science and Technology	-	-	-	-	-	94.83 (AR)	2019	ReCTS Competition

3. Survey

[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[52] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper

4. OCR Service

OCR	API	Free	Code
Tesseract OCR Engine	×	√	√
Azure	√	√	×
ABBYY	√	√	×
OCR Space	√	√	×
SODA PDF OCR	√	√	×
Free Online OCR	√	√	×
Online OCR	√	√	×
Super Tools	√	√	×
Online Chinese Recognition	√	√	×
Calamari OCR	×	√	√
Tencent OCR	√	×	×

5. References

[1] [ICCV-2011] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of International Conference on Computer Vision (ICCV), pages 1457–1464, 2011. paper

[2] [BMVC-2012] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In Proceedings of British Machine Vision Conference (BMVC), pages 1–11, 2012. paper dataset

[3] [ICPR-2012] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Proceedings of International Conference on Pattern Recognition (ICPR), pages 3304–3308, 2012. paper

[4] [ICDAR-2013] V. Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), pages 398–402, 2013. paper

[5] [ICCV-2013] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of International Conference on Computer Vision (ICCV), pages 785–792, 2013. paper

[6] [ICCV-2013] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan. Recognizing text with perspective distortion in natural scenes.In Proceedings of International Conference on Computer Vision (ICCV), pages 569–576, 2013. paper

[7] [ICLR-2014] O. Alsharif and J. Pineau, End-to-end text recognition with hybrid HMM maxout models, in: Proceedings of International Conference on Learning Representations (ICLR), 2014. paper

[8] [TPAMI-2014] J. Almaz ́ an, A. Gordo, A. Forn ́ es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE Trans.Pattern Anal. Mach. Intell ., 36(12):2552–2566, 2014. paper code

[9] [CVPR-2014] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4042–4049, 2014. paper

[10] [IJCV-2015] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision (IJCV) , 113(3):193–207, 2015. paper

[11] [ECCV-2014] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In Proceedings of European Conference on Computer Vision (ECCV), pages 512–528, 2014. paper code

[12] [ACCV-2014] B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In Proceedings of Asian Conference on Computer Vision (ACCV), pages 35–48, 2014. paper

[13] [CVPR-2015] A. Gordo. Supervised mid-level features for word image representation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2956–2964, 2015. paper

[14] [IJCV-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. Int. J.Comput. Vision, 2015. paper code

[15] [ICLR-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Deep structured output learning for unconstrained text recognition, in: Proceedings of International Conference on Learning Representations (ICLR), 2015. paper

[16] [TPAMI-2017] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304, 2017. paper code-Torch7 code-Pytorch

[17] [CVPR-2016] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4168–4176, 2016. paper

[18] [CVPR-2016] C.-Y. Lee and S. Osindero. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2231–2239, 2016. paper

[19] [BMVC-2016] W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han. STAR-Net: A spatial attention residue network for scene text recognition. In Proceedings of British Machine Vision Conference (BMVC), page 7, 2016. paper

[20] [IJCAI-2017] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 2017. paper

[21] [ICCV-2017] F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu. Scene text recognition with sliding convolutional character models. In Proceedings of International Conference on Computer Vision (ICCV), 2017. paper code

[22] [ICCV-2017] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of International Conference on Computer Vision (ICCV), pages 5086–5094, 2017. paper

[23] [CVPR-2018] Cheng Z, Xu Y, Bai F, et al. AON: Towards Arbitrarily-Oriented Text Recognition.In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 5571-5579, 2018. paper code

[24] [arXiv-2017] Gao Y, Chen Y, Wang J, et al. Reading Scene Text with Attention Convolut ional Sequence Modeling[J]. arXiv preprint arXiv:1709.04303, 2017. paper

[25] [AAAI-2018] Liu W, Chen C, Wong K Y K. Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition[C]//AAAI. 2018. paper

[26] [AAAI-2018] Liu Z, Li Y, Ren F, et al. SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network[C]//AAAI. 2018. paper

[27] [CVPR-2018] Bai, F, Cheng, Z, Niu, Y, Pu, S and Zhou,S. Edit probability for scene text recognition, pages 1508-1516, 2018. paper

[28] [ECCV-2018] Liu Y, Wang Z, Jin H, et al. Synthetically Supervised Feature Learning for Scene Text Recognition[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 435-451. paper

[29] [ICIP-2018] Gao Y, Chen Y, Wang J, et al. Dense Chained Attention Network for Scene Text Recognition[C]//2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018: 679-683. paper

[30] [TPAMI-2018] Shi B, Yang M, Wang X, et al. Aster: An attentional scene text recognizer with flexible rectification[J]. IEEE transactions on pattern analysis and machine intelligence, 2018. paper code

[31] [CVPR-2012] A. Mishra, K. Alahari, and C. V. Jawahar. Top-down and bottom-up cues for scene text recognition. In CVPR, 2012. paper

[32] https://github.com/Canjie-Luo/MORAN_v2

[33] [IJDAR-2005] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young,K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao,J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. ICDAR 2003 robust reading competitions:entries, results,and future directions. IJDAR, 7(2-3):105–122, 2005. paper

[34] [ICDAR-2013] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda,S. R. Mestre, J. Mas, D. F. Mota, J. Almaz ́ an, and L. de las Heras. ICDAR 2013 robust reading competition. In ICDAR, 2013. paper

[35] [ICCV-2013] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, 2013. paper

[36] [Expert Syst.Appl-2014] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41(18):8027–8048, 2014. paper

[37] [ICDAR-2015] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160,2015. paper

[38] [arXiv-2016] Veit A, Matera T, Neumann L, et al. Coco-text: Dataset and benchmark for text detection and recognition in natural images[J]. arXiv preprint arXiv:1601.07140, 2016. paper code

[39] [ICDAR-2017] Ch'ng C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition[C]//Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942. paper code

[40] [ICDAR-2017] Shi B, Yao C, Liao M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17)[C]//Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 1429-1434. paper

[41] [ICPR-2018] He M, Liu Y, Yang Z, et al. ICPR2018 Contest on Robust Reading for Multi-Type Web Images[C]//2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018: 7-12. paper

[42] [arXiv-2018] Yuan T L, Zhu Z, Xu K, et al. Chinese Text in the Wild[J]. arXiv preprint arXiv:1803.00085, 2018. paper code

[43] [arXiv-2017] Yuliang L, Lianwen J, Shuaitao Z, et al. Detecting curve text in the wild: New dataset and new solution[J]. arXiv preprint arXiv:1712.02170, 2017. paper code

[44] [ECCV-2018] Yao C, Wu W. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes.//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 71-88. paper code

[45] [NIPS-WORKSHOP-2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco,Bo Wu, and Andrew YNg. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011. paper

[46] [PR-2019] C. Luo, L. Jin, and Z. Sun, “MORAN: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, vol. 90, pp. 109–118, 2019. paper code

[47] [ACM-2019] Xie H, Fang S, Zha Z J, et al, “Convolutional Attention Networks for Scene Text Recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, pp. 3 2019. paper

[48] [AAAI-2019] Liao M, Zhang J, Wan Z, et al, “Scene text recognition from two-dimensional perspective,” //AAAI. 2019. paper

[49] [AAAI-2019] Li H, Wang P, Shen C, et al, “Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition,” //AAAI. 2019. paper code

[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[52] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper

[53] [NIPS-WORKSHOP-2014] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, in: Proceedings of Advances in Neural Information Processing Deep Learn. Workshop (NIPS-W).2014. paper code

[54] [CVPR-2016] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2315–2324. paper code

[55] [CVPR-2019] Zhan F, Lu S. Esir: End-to-end scene text recognition via iterative image rectification, in: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2059-2068. paper

[56] [CVPR-2019] Zhang Y, Nie S, Liu W, et al. Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition, in: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2740-2749. paper code

[57] ICDAR2019 Robust Reading Challenge on Large-scale Street View Text with Partial Labeling. Link

[58] ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text. Link

[59] ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard. Link

[60] [arXiv-2019] X. Chen, T. Wang, Y. Zhu, L. Jin, and C. Luo. Adaptive Embedding Gate for Attention-Based Scene Text Recognition.[J] arXiv preprint arXiv:1908.09475, 2019. paper

6.Help

If you find any problems in our resources, or any good papers/codes we have missed, please inform us at xxuechen@foxmail.com. Thank you for your contribution.

luqiang6q/Scene-Text-Recognition

Scene Text Recognition Resources

1. Datasets

1.1 Regular Scene Text Datasets

1.2 Irregular Scene Text Datasets

1.3 Bilingual Scene Text Datasets (mainly in Chinese and English)

1.4 Synthetic Datasets

1.5 Comparison of Datasets

2. Summary of Scene Text Recognition Results

2.1 Comparison of methods

2.2 Recognition Results

2.2.1 Recognition Results on Regular Dataset

2.2.2 Recognition Results on Irregular Dataset

2.2.3 Recognition Results on Bilingual Scene Text Dataset

3. Survey

4. OCR Service

5. References

6.Help

7.Copyright