UnarchivingBengali

The objective of this research is to upgrade/fine-tune the .traineddata model of Google's open source OCR engine Tesseract so that the out-of-print or archived bengali literature can be converted to a plaintext format.

Releases

Please check the Releases tab for latest updates on the status. Current approaches do not show any significant performance improvements. Experimentation is needed to enhance the performance.

Training data

Training data is obtained from converting and cleaning line-level images from this archived text.

Reading list

Tesstrain README
Tesseract Documentation
VGSL Specs
Neural Networks in Tesseract OCR
Training Tesseract

srdg/unarchivingbengali

UnarchivingBengali

Releases

Training data

Reading list