/TH-National-Document-OCR-Part-II

Part II Tesseract OCR Tools for reading Thai National Document using TH Sarabun National Font for training and finetune. Read README.md to see my step to developing.

Primary LanguageJupyter Notebook

TH National Document OCR (THND OCR)

DOI

Tesseract OCR Tools for read Thai National Document using TH Sarabun National Font for training and finetune. Read README.md to see my step to developing.

Part I : https://github.com/copninich/TH-National-Document-OCR-Part-I

Part II: https://github.com/copninich/TH-National-Document-OCR-Part-II

Medium : https://copninich.medium.com/thnd-ocr-the-series-version-%E0%B8%A1%E0%B9%89%E0%B8%A7%E0%B8%99%E0%B9%80%E0%B8%94%E0%B8%B5%E0%B8%A2%E0%B8%A7%E0%B8%88%E0%B8%9A%E0%B8%9A%E0%B8%9A%E0%B8%9A-%E0%B8%AD%E0%B8%B0%E0%B9%80%E0%B8%AB%E0%B9%89%E0%B8%A2%E0%B8%A2-5a673cc8a686

0. Information

0.1 Tools

0.2 Dataset

0.3 Project information

  • Ai Builders
  • Kampanart Chaimooltan

01.Performance testing

Using Character Errorate and leght string (OCR & Correct Text) and output result testing (.csv file)

02.Generate dataset

Using PIL library and using TH Sarabun formart font 72 px to create dataset.

03.Testing train and fine tune tesserct (default langdata_lstm)

Requirements langdata_lstm , tesseract 4 , tessdata_best

Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing

04.Train and fine tune

Run scrript script_basic.ipynb or script_config_error.ipynb Requirements langdata_lstm , tesseract 4 , tessdata_best

Custom tha.training_text with my own dataset more than 60k sentences

Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing

05.Performance testing

report_performace_final.csv