
A collection of OCR-related datasets

OCR Datasets

This repo collects OCR-related datasets. In general, the datasets are classified by 6 types, i.e., Natural Scene Text, Document Text, Handwritten Text, Historical Document Text, Video Text, and Synthetic Text.

OCR Dataset Type

  • Natural Scene Text: The images in this type of dataset are usually taken in natural scenes, so the difficulty of this task lies in the complex lighting transformations, shooting angles, blurring, varied fonts, etc.
  • Document Text: only focues on document images, the difficulty is the variety of typesetting.
  • Historical Document Text: is usally designed for assisting social science research. For example, digitized antiquarian documents help preserve historical materials and facilitate scholars to conduct related research.
  • Video Text: aims at recognizing texts in videos, which introduces temporal information into the OCR task.
  • Synthetic Text: synthetically generates images containing texts and the corresponding annotations by rendering texts of different fonts into natural photos. This type of dataset usually includes hundreds of thousands of samples since it does not require human beings to annotate the images. However, due to the limited technology, there is usually a large domain gap between the synthetic images and authentic samples; these datasets are often employed for pre-training only.
Natural Scene Text
Year/Venue Name Task #Train(#wds) #Val(#wds) #Test(#wds) Granu. Anno. Form Language Scene Paper Size
2003-05/ICDAR IC03/IC05 Det. & Rec. 258 (1110) N/A 251 (1156) Word Rect [x, y, w, h, "transcript"] English Natural PDF 112MB
2011-15/ICDAR Born-DIgital-Image (IC2011-2015) Det. & Rec. & Seg. 410 (3564) N/A 141 (1439) Word & Pixel Rect [x, y, w, h, "transcript"] English Natural/Web/Email PDF 40MB
2013-15/ICDAR Focused Scene Text (IC13) Det. & Rec. & Seg. 229 (848) N/A 233 (1095) Word & Pixel Rect [x1, y1, x2, y2, "transcript"] & SegMap English Natural PDF 250MB
2015/ICDAR Incidental Scene Text (IC15) Det. & Rec. 1,000 (4468) N/A 500 (2077) Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English Natural PDF 130MB
2017/ICDAR Multi-Lingual Scene Text (MLT2017) Det. & Rec. 7,200 1,800 private Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans'] multi-lingual Natural - 12GB
2019/ICDAR Multi-Lingual Scene Text (MLT2019) Det. & Rec. 10,000 N/A 10,000 Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans'] multi-lingual Natural PDF ~12GB
2017/ICDAR COCO-Text v2.0 Det. & Rec. 43,686 10,000 10,000 Word Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] En & NonEn Natural PDF 13GB
2019/ICDAR ReCTS Det. & Rec. 20,000 N/A 5,000 Word/Line Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] Chinese Signboard - ~2.5GB
2017/ICDAR Total-Text Det. & Rec. 1255 N/A 300 Word & Pixel Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] English Natural PDF 441MB
2019/PR SCUT-CTW1500 Det. & Rec. 1,000 N/A 500 Line Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] En & Ch Natural PDF 800MB
2019/ICDAR Arbitrary-Shaped Text (ART) Det. & Rec. 5,603 (50,029) N/A 4,563 (52,631) Word(En)/Line(CH) Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], Lan, 'trans'] En & Ch Natural - 4.4GB
2017/ICDAR RCTW-17 (CTW-12k) Det. & Rec. 11514 N/A 1000 Line Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] Chinese Mixture PDF 11GB
2019/ICDAR/ICCV Large-scale Street View Text (LSVT) Det. & Rec. 30,000 N/A 20,000 Line Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] En & Ch Street View PDF 14GB
2016/DAS MLe2e Det. & Script Identifica. 450 N/A 261 Word Rect [x1, y1, x2, y2, language] multi-lingual Natural PDF 82MB
2017/ICDAR IIIT-ILST Det. & Rec. 893 Word Rect [x, y, w, h, "transcript"] Indic Google Images PDF 609MB
2017/CVPRW UberText Det. & Rec. 117,969 (571,534) Word Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] English Street View PDF 197GB
2009/VISAPP Chars74k Det. & Rec. 1922 Character En & Kanada Natural Scene PDF 739MB
2010/ICPR KAIST Det. & Rec. & Seg. 3000 Char & Word & Pixel Rect [x, y, w, h, "transcript"] & SegMap En & Korean Mixture PDF 364MB
2010/ECCV SVT Det. & Rec. 100 (211) N/A 250 (514) Word Rect [x, y, w, h, "transcript"] English Street View PDF 118MB
2013/ICCV SVTP (download code:vnis) Rec. 238 (639) - English Street View PDF ~1MB
2011/NIPSw SVHN Det. & Rec. 73,257+531,131 N/A 26,032 Character Rect [x, y, w, h, "transcript"] Digit House Number PDF ~3GB
2011/ICDARw NEOCR Det. 659 (5,238) Line Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] multi-lingual Natural Scene PDF 1.3GB
2012/CVPR MSRA-TD500 Det. 300 N/A 200 Line RotRect [ind, difficult, x, y, w, h, theta] multi-lingual Street View PDF 96MB
2012/BMVC IIIT 5k-word Rec. 380 (2000) N/A 740 (3000) Word English Natural PDF 106MB
2014/ESWA CUTE80 Rec. 80 Line Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]]] English Street View PDF 44MB
2015/TPAMI USTB-SV1K Det. & Rec. 500 N/A 500 Word RotRect [ind, difficult, x, y, w, h, theta, "trans"] English Street View PDF 36MB
2019/JCST Chinese Text in the Wild (CTW) Det. & Rec. 25,887(812,872chrs) N/A 3,269(103,519chrs) Char & Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] Chinese Street View PDF ~40GB
2019/TITS ShopSign Det. & Rec. 1258 sample images Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] Chinese Signboard PDF 3GB
2021/CVPR TextOCR Det. & Rec. & VQA 24902 (822,572) N/A 3232 (80,497) Word Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] English Natural Scene PDF ~8GB
2021/CVPR VinText Det. & Rec. 1,200 N/A 300+500 Word Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] Vietnamese Natural Scene PDF 1GB
2018/Competition ICPR MTWI2018 Det. & Rec. 10,000 N/A 10,000 Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] En & Ch WEB Images PDF 2GB
2019/Competition 百度中文场景文字识别比赛 Rec. 50,000 N/A 10,000 - [h, w, 'trans'] En & Ch Street View -
Document Text
Year/Venue Name Task #Train #Val #Test Granu. Anno. Form Language Scene Paper Size
2011/ICDAR RETAS No public download link   Char & Word No public download link -
2013/IJDAR LRDE-DBD Document Binarization Det. & Binarization 125 Line & Mask Rect French Magzine PDF ~700MB
2015/ICDAR SmartDOC 3630 N/A 8470 PDF ~30GB
2016/ICFHR KPTI Rec. 11,910 2,552 2,553 - ['transcripts'] Pashto Document PDF ~100MB
2017/ICDAR DeText Det. & Rec. 100 100 300 Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English Scientific
2019/ICDAR SROIE Det. & Rec. & Info Ext. 600 400 Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English Receipt - <1GB
2019/ICDAR FUNSD Det. & Rec. & Info Ext. 149 N/A 50 Word Rect [x1, y1, x2, y2, "transcript"] English Form PDF 16MB
2019/ICDAR NAF Det. & Rec. & Info Ext. 682 59 63 Line Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English Form PDF
2020 BID Det. & Rec. 28880 Line Poly Latin ID Document
2020/ISCSIC DDI-100 Det. & Rec. ~ 100,000 (70% train, 30% val) Char & Word & Mask Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English Distorted Document PDF ~300GB
Handwritten Text
Year/Venue Name Task #Train #Val #Test Granu. Anno. Form Language Scene Paper Size
2008-11/ICDAR RIMES No public download link Word & Line No public download link
2010/DAS HIT-OR3C Rec. Char set 832,650 chars / Doc set 77,168 chars - special format Chinese Handwritten PDF 1GB
2012/PR KHATT Rec. 8,368 1,793 1,822 - ['transcripts'] Arabic Handwritten PDF
98-2014 HANDS No public download link Japanese Handwritten
- Lao-SABAIDEE 500 SAMPLES No public download link   Laos Handwritten
2014/ICFHR ORAND-CAR/CVL Rec. 5,000 N/A 5,000 Word ['image_name', 'trans'] Digits Handwritten Digits PDF 194MB
2018/ICFHR VNOnDB Rec. 1,146 paragraphs 7,296 lines
380,000 chars
Word/Line/Parag. Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] Vietnamese Handwritten PDF 200MB
2013-16/IJDAR PE92/SERI95/HanDB (HangulDB) Rec. 1200 samples (90% Train/10% Test) .HGU1 format Korean Handwritten PDF 800MB
95-2016 NIST Rec. English
2011/ICDAR CASIA-OLHWDB/HWDB Rec. Chinese Handwritten PDF
2021/ICDAR IIT-INDIC-HW-WORDS Rec. 872,000 instances Word ['image_name', 'vocab_id'] & vocabularly Indic Handwritten PDF ~20GB
1999/ICDAR IAM Handwriting Database Rec. 6,161 900+940 1,861 Registration is Required
2005/ICDAR IAM ONLINE Handwritting Data Rec. 86,272 word instances Registration is Required
2018/ICDAR IAM-MonDo Rec. Registration is Required    PDF
2011-14/ICDAR CHROME Rec. > 10,000 expressions symbol & expression inkml format, latex Symbol Mathematical PDF 58MB
2017/ICDAR MUSICMA++ Rec. 140 Symbol Music Notation PDF
2018/Access SCUT-EPT Rec. 40,000 N/A 10,000 Chinese Educational Doc. PDF 1.08GB
2020/ICFHR HHD Rec. 3965 1134 Hebrew PDF
2021/ArXiv IMGUR5K Det. & Rec. (~108,000) (~13,000) (~14,000) Word Rect [x, y, w, h, "transcript"] English Handwritten PDF -
2021/ArXiv VML-MOC Seg. & Rec. Hebrew PDF
2021/ICDAR Bengali Rec. Bengali PDF
2021/ICDAR GNHK Det. & Rec. 687 Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English PDF
Historical Document Text
Year/Venue Name Task #Train #Val #Test Granu. Anno. Form Language Scene Paper Size
2010-11/DAS IAM-HistDB Rec. 127 Word & Line ['image_id', 'transcript'] En & Ger & Latin >200mb
2016/ICFHR H-KWS (1. Botany 2. AK) Det. & Rec. 1849 3734 N/A Word & Line Rect [x, y, w, h, "transcript"] English PDF
2016/ICFHR READ Registration is Required German PDF ~600mb
2017/ICFHR Palm Leaf Manuscript Det. & Rec. ~19,000 Balinese + ~20,000 Khmer Char No public download link Khmer Palm Leaf
2017/HIP SleukRith-Set Det. & Rec. 658 Char & Word Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'transcript'] Khmer Palm Leaf PDF 1GB
2019/NCA ARDIS Rec. 10,000 Char & Word ['transcript'] Digits Church Records PDF
2019/ICDAR Pinkas Det. & Rec. Word & Line Hebrew historical manuscripts PDF ~50MB
2020/ICFHR Cuneiform PDF
2020/ICFHR MTHv2 Det. & Rec. 2,399 N/A 800 Char & Line Chinese Acient Book PDF 4.6GB
2021/ICDAR IHR-NomDB Det. & Rec. 267 Line Rect [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] ChuNom Acient Book PDF
2021/ICDAR VML-HP Hebrew PDF
2019/ICDAR IndiScapes Seg No public download link Indic PDF
Video Text
Year/Venue Name Task #TrainVids (#frames) #ValVids (#f) #TestVids(#f) Granu. Anno. Form Language Scene Paper Size
2013/15/ICDAR Text in Videos (IC13) Det. & Rec. 25 (13450) 24 (14374) Word Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English Natural PDF
2015/ICDAR CVSI2015 No public link for download multi-lingual PDF
2017/ICDAR DOST Word QUAD Japanese
2018/ICFHR LectureVideoDB Det. & Rec. -52,225 -27,900 -36,460 Word English Slides/Paper PDF 2.3GB
2020/ICRA RoadText-1K Det. & Rec. 500 (150,000) 200 (60,000) 300 (90,000) Line Rect [x1, y1, x2, y2, "transcript"] & SegMap En & NonEn Road/Traffic PDF
2020/ICMV MIDV-500 & MIDV-2019 Det. & Rec. & Others 500 video clips Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] multi-lingual Document PDF 32GB
2021/ICDAR MIDV-LAIT Det. & Rec. & Others Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] multi-lingual Document PDF
2020/ICPR AcTiVComp Det. & Rec. 2557 frames Line Rect [x1, y1, x2, y2, "transcript"] Arabic
Synthetic Text
Year/Venue Name Task #Train #Val #Test Granu. Anno. Form Language Scene Paper Size
2016/CVPR Synth800k Det. & Rec. 858,750 (7,266,866) Char & Word & Line Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] English Synthetic PDF 41GB
2020 UnrealText 728,000 En + 674,000 others multi-lingual
- Chinese_ocr Det. & Rec. ~ 364 million Chinese Document
- UPTI Urdu
- APTI 45313600 (> 250 million chars) Word arabic
2021/ICDAR SynthTiger Rec. PDF
2021/ICDAR DocSynth PDF