Bangla OCR

Dataset Description:

BanglaWriting

Process

Preprocessing

The dataset is not processed and it needs further preprocessing. From the raw image folder the word images have been extracted using the provided json file. During the extraction process the cropped images are binarized using Otsu’s Binarization technique. The filename follows the configuration below.

"পরিবার 18__225_15_1.jpg" as "label wordNumberOfThePage__uniquePersonNumber_age_gender.extension"

Model

CRNN = CNN + BiDirectional GRU

Loss Function

CTC Loss

Optimizer

Adam

Usage

Download the dataset from the provided link and unzip the "raw" file in the current directory and run

python generator.py

Finally, run the notebook.

Requirements

python==3.7.0
numpy=1.16.0
scikit-learn==0.23.2
opencv-python==4.4.0.46
torch==1.7.0
tqdm==4.53.0

Further Improvement can be done through:

Preprocessing such as skew correction, noise removal, thinning and skeletonization
Gathering and/or generating synthetic data
Making the dataset balanced
Using Focal CTC loss to overcome class imbalance problem
Using Edit distance to predict neareast word
Using better optimizer such as RAdam

sazzadhrz/Bangla-OCR