Tesseract OCR Tools for read Thai National Document using TH Sarabun National Font for training and finetune. Read README.md to see my step to developing.
Part I : https://github.com/copninich/TH-National-Document-OCR-Part-I
Part II: https://github.com/copninich/TH-National-Document-OCR-Part-II
- Tesseact : https://github.com/tesseract-ocr/tesseract
- PyThaiNLP (Prachathai) : https://github.com/PyThaiNLP/prachathai-67k
- PyThaiNLP (ThaiGov V2 Corpus) : https://github.com/PyThaiNLP/thaigov-v2-corpus
- PyThaiNLP (ThaiGov Archive corpus) : https://github.com/PyThaiNLP/thaigov-archive-corpus
- Thaisum : https://github.com/nakhunchumpolsathien/ThaiSum
- TR-TPBS : https://github.com/nakhunchumpolsathien/TR-TPBS
- Ai Builders
- Kampanart Chaimooltan
Using Character Errorate and leght string (OCR & Correct Text) and output result testing (.csv file)
Using PIL library and using TH Sarabun formart font 72 px to create dataset.
Requirements langdata_lstm , tesseract 4 , tessdata_best
Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing
Run scrript script_basic.ipynb or script_config_error.ipynb Requirements langdata_lstm , tesseract 4 , tessdata_best
Custom tha.training_text with my own dataset more than 60k sentences
Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing
report_performace_final.csv