We construct a new benchmarks for Hangul OCR which has intractable number of classes. The proposed benchmark reveals the class-imbalance and target-class selecting issues in Hangul OCR.
Name | Link | Description |
---|---|---|
AI Hub train-set |
Website* | It consists of about 100,000 images of Hangul characters. A total of 674,110 text areas are extracted to evaluate the performance of character recognition. Of these, 10,000 are separated into the test set, and the rest are used as a training data. |
AI Hub test-set |
Google Drive | - |
MLT-h test-set |
Google Drive | MLT dataset was introduced in ICDAR to resolve the problem of multi-lingual text detection and script identification. We exploit only the Hangul text regions in the MLT17 test-set for the evaluation, and name it as MLT-h. We have found many annotation errors in this data set, we rectified those noisy labels. |
SFW test-set |
Google Drive | To emphasize the class-imbalance problem in Korean character recognition, we have synthesized a new dataset containing a large number of minority classes using SynthTiger. The dataset contains a total of 18,831 standard foreign words that are registered in the National Institute of the Korean Language. |
Unseen Characters test-set |
Google Drive | To evaluate robustness on the unseen characters, we have selected 72 characters in SFW that could not be represented with a common character encoding, and generated an image per character. |
*AI Hub train-set shall be downloaded from the official website. We cropped text regions for training and this dataset will be available soon.
Our paper is accepted on ECCV 2022 TiE workshop.
@article{kim2022character,
title={Character decomposition to resolve class imbalance problem in Hangul OCR},
author={Kim, Geonuk and Son, Jaemin and Lee, Kanghyu and Min, Jaesik},
journal={arXiv preprint arXiv:2208.06079},
year={2022}
}