Benchmark data for Hangul OCR

We construct a new benchmarks for Hangul OCR which has intractable number of classes. The proposed benchmark reveals the class-imbalance and target-class selecting issues in Hangul OCR. hangul

Download Link

Name Link Description
AI Hub
train-set
Website* It consists of about 100,000 images of Hangul characters. A total of 674,110 text areas are extracted to evaluate the performance of character recognition. Of these, 10,000 are separated into the test set, and the rest are used as a training data.
AI Hub
test-set
Google Drive -
MLT-h
test-set
Google Drive MLT dataset was introduced in ICDAR to resolve the problem of multi-lingual text detection and script identification. We exploit only the Hangul text regions in the MLT17 test-set for the evaluation, and name it as MLT-h. We have found many annotation errors in this data set, we rectified those noisy labels.
SFW
test-set
Google Drive To emphasize the class-imbalance problem in Korean character recognition, we have synthesized a new dataset containing a large number of minority classes using SynthTiger. The dataset contains a total of 18,831 standard foreign words that are registered in the National Institute of the Korean Language.
Unseen Characters
test-set
Google Drive To evaluate robustness on the unseen characters, we have selected 72 characters in SFW that could not be represented with a common character encoding, and generated an image per character.

*AI Hub train-set shall be downloaded from the official website. We cropped text regions for training and this dataset will be available soon.


Sample images

* AI Hub

스크린샷 2022-07-21 오전 11 37 10

* MLT-h

* SFW

스크린샷 2022-07-21 오전 11 42 29

* Unseen Characters


Citation

Our paper is accepted on ECCV 2022 TiE workshop.

@article{kim2022character,   
title={Character decomposition to resolve class imbalance problem in Hangul OCR},   
  author={Kim, Geonuk and Son, Jaemin and Lee, Kanghyu and Min, Jaesik},   
  journal={arXiv preprint arXiv:2208.06079},   
  year={2022}
}