A multi-label language identification dataset based on regional Indian languages. It contains 5 languages (Hindi, Bengali, Malayalam, Kannada, and English) with the presence of two scripts per image (implying the multi-linguality). The dataset is diverse in nature with the existence of curved, perspective distorted, and multi-oriented text in addition to the horizontal text. This diversity is achieved by applying various image transformation techniques such as affine, arcs, and perspective distortion with different angular degrees. The dataset is harvested from multiple sources: captured from mobile cameras, existing datasets, and web sources.
Fig. 1 : Sample examples from IIITG-MLRIT2022
@article{naosekpam2023multi,
title={Multi-label Indian scene text language identification},
author={Naosekpam, Veronica and Sahu, Nilkanta},
journal={Intelligent Systems and Applications in Computer Vision},
year={2023},
publisher={CRC Press}<
}
or
Naosekpam, Veronica, et al. "EMBiL: An English-Manipuri Bi-lingual Benchmark for Scene Text Detection and Language Identification." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.