/unarchivingbengali

Primary LanguageJupyter Notebook

UnarchivingBengali

The objective of this research is to upgrade/fine-tune the .traineddata model of Google's open source OCR engine Tesseract so that the out-of-print or archived bengali literature can be converted to a plaintext format.

Releases

Please check the Releases tab for latest updates on the status. Current approaches do not show any significant performance improvements. Experimentation is needed to enhance the performance.

Training data

Training data is obtained from converting and cleaning line-level images from this archived text.

Reading list