Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process.
-
Part I : https://github.com/copninich/TH-National-Document-OCR-Part-I
-
Part II : https://github.com/copninich/TH-National-Document-OCR-Part-II
- Tesseract : https://github.com/tesseract-ocr/tesseract
- PyThaiNLP (Prachathai) : https://github.com/PyThaiNLP/prachathai-67k
- PyThaiNLP (ThaiGov V2 Corpus) : https://github.com/PyThaiNLP/thaigov-v2-corpus
- PyThaiNLP (ThaiGov Archive corpus) : https://github.com/PyThaiNLP/thaigov-archive-corpus
- Thaisum : https://github.com/nakhunchumpolsathien/ThaiSum
- TR-TPBS : https://github.com/nakhunchumpolsathien/TR-TPBS
- Ai Builders 2021
- Kampanart Chaimooltan
I used Character Errorate and leght string (OCR & Correct Text) and output result testing (.csv file)
I used PIL library. in addtion, I used TH Sarabun formart font 72 px to create datasets.
Link : https://www.kaggle.com/copninich/thaienglish-character-in-th-sarabun-font
Requirements
- langdata_lstm
- tesseract v.4
- tessdata_best
Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing
Run script script_basic.ipynb
or script_config_error.ipynb
Requirements
- langdata_lstm
- tesseract v.4
- tessdata_best
Custom tha.training_text
with my own datasets more than 1.9 M sentences
report_performace_final.csv