/bnhocr

Single Grapheme Prediction based model, Bangla HOCR

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

bnhocr

Optical Character Recognition for bangla handwritten and printed documents

Single Grapheme Prediction based model, Bangla HOCR

   version:0.0.1
   authors:MD.Nazmuddoha Ansary, (team ovijatrik,apsis solutions ltd,bengali.ai)
           MD.Mobassir Hossain,  (team ovijatrik,apsis solutions ltd,bengali.ai) 
           MD.Aminul Islam       (team ovijatrik)

This project was created in association with:

Environment

  • For ubuntu install tesseract bangla: install tesseract-ocr-ben
  • For windows (Untested source):
    • Download and install tesseract-ocr-w64-setup-v5.0.0-rc1.20211030.exe (or the latest one)
    • Open https://github.com/tesseract-ocr/tessdata and download your language. For example, for Bangla download ben.traineddata.
    • Copy the downloaded file to the tessreact_ocr installation location, some location like: C:\Program Files\Tesseract-OCR\tessdata
    • Don't forget to use the traineddata name for the language. For bangla, I use lang='ben'.

python requirements

  • pip requirements: pip install -r requirements.txt

Its better to use a virtual environment Some of the pip requirements may not work properly due to locally saved modules OR use conda-

  • Preffered way: conda: use environment.yml: conda env create -f environment.yml

model requirements

  • Download model.h5 file
  • place the model.h5 file under models folder

LOCAL ENVIRONMENT/TESTING ENVIRONMENT

OS          : Ubuntu 20.04.3 LTS       
Memory      : 23.4 GiB 
Processor   : Intel® Corei5-8250U CPU @ 1.60GHz × 8    
Graphics    : Intel® UHD Graphics 620 (Kabylake GT2)  
Gnome       : 3.36.8

About bnhocr: Printed and Handwritten text recognition

There are available models such as : tesseract,Easy OCR. that covers bangla printed text to a considerable accuracy.In this project we solely focus on handwritten texts.

  • The idea of bnhocr project is to convert handwritten graphemes to a unique representational space (in our example font faced image).
  • Converting single graphemes does not cover words. Separated graphemes can be used if a grapheme localization model is used.Watch This Video To For The Basic Idea

  • To extend for words, we build on single grapheme transformation model and extend on handwritten words.
  • Then we use an existing recognizer (in our example-tesseract)

  • A short presentation about this work is available at resources/slides.pdf

Demo

  • clone the repo
  • install dependencies
  • streamlit run app.py

Graphemes

@inproceedings{alam2021large,
  title={A large multi-target dataset of common bengali handwritten graphemes},
  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={383--398},
  year={2021},
  organization={Springer}
}

Known Issues

  • model is not cached while running in streamlit
  • only launched for tesseract
    • Easy OCR and Detection models can be easily added for wide applications