The goal of this experiment will be simple which is to solve a simplified version of line text recognition problem, a character recognizer.
The dataset we will be using for this task will be EMNIST which thanks to Cohen and et al it is labelled.
Here we experimented with 3 different architecture lenet, resnet and a custom CNN architecture.
Results
- Lenet
- Resnet
- Custom
- Evaluation on Test dataset
Breakdown of classification for test dataset using above 3 architectures.
Learnings
- Initially we trained all models with a constant learning rate.
- Instead of using constant learning rate, we implemented cyclic learning rate and learning rate finder which provided a great boost in terms of both speed and accuracy for performing various experiments.
- Transfer learning with resnet-18 performed poorly.
- From above results of test evaluation, we can see that model performs poorly on specific characters as there can be confusion due to similarity like digit 1 and letter l, digit 0 and letter o or O, digit 5 and letter s or S or digit 9 and letter q or Q.
- Accuracies on train dataset are 78% on lenet, 83% on resnet and 84% on custom.
- Accuracies on val dataset are 80% on lenet, 81% on resnet and 82% on custom.
- Accuracies on test dataset are 62% on lenet, 36% on resnet and 66% on custom.
- Custom architecture performs well but resnet perform poorly (Why?)
- There is a lot of gap in train-val and test even when val distribution is same as test distribution i.e. val set is taken from 10% of test set.
- Look for new ways to increase accuracy
Next, we will build a Line Text Recognizer. Given a image of line of words, the task will be to output what characters are present in the line.
We will use sliding window of CNN and LSTM along with CTC loss function.
For this we will use a synthetic dataset by constructing sentences using EMNIST dataset and also use IAM dataset for training.
We first constructed EMNIST Lines dataset. To construct this dataset we used characters from EMNIST dataset and text from brown corpus from nltk. We fixed the number of characters in each line to be 34. The new shape of image in the dataset will be (28, 28*34). The image below show some sample examples from EMNIST lines dataset.
We started with simplest model i.e. to use only CNN to predict the characters in the lines. We tried using 3 different architectures same as above lenet, resent and custom. We achieved character accuracy of 1%, 0.017% and 3.6%.
- Lenet CNN
- Resnet CNN
- Custom CNN
Next, building a complex model. We created a CNN-LSTM model with CTC loss with 3 different CNN architectures like lenet, resnet and custom as backbone. The results were remarkable. We achieved an character accuracy of 95% with lenet and 96% with custom architecture.
- Lenet and Custom LSTM-CTC Model
- Lenet LSTM-CTC Model
- Custom LSTM-CTC Model
Now we tried the same model with just changing the dataset. We replaced EMNIST Lines with IAM Lines dataset.
And the results.
- Lenet and Custom LSTM-CTC Model
- Lenet LSTM-CTC Model
- Custom LSTM-CTC Model
Learnings
- Switching datasets worked but still requires a lot of time to train for further fine prediction i.e train more.
- LSTM involves a lot many experiments use bidirectional or not, use gru or lstm. Trying different combinations might help get even better results for each CNN architecture.
- Further, we can make use of attention-based model and use language models which will make model more robust.
- Using beam search decoding for CTC Models
Almost done! We have completed Line Text predictor. Now comes the part of implementing Line Detector. For this, we will use IAM dataset again but paragraph dataset. Here is a sample image from paragraph dataset.
The objective in this experiment is to design a line detector. Given a paragraph image the model must be able to detect each line. What do you mean by detect? We will preprocess the paragraph dataset such that each pixel corresponds to either of the 3 classes i.e. 0 if it belongs to background, 1 if it belongs to odd numbered line and 2 if it belongs to even numbered line. Wait, why do you need 3 classes, when 2 are sufficient? The image below explains why we need 3 classes instead of 2?
With 2 classes : 0 for background and 1 for pixels on line.
With 3 classes : 0 for background, 1 for odd numbered-line and 2 for even numbered-line.
Here is how our dataset for line detection will look like after preprocessing.
Here is a sample after apply data augmentation.
Now that we have dataset, images with paragraph of size (256, 256) and ground truths of size (256, 256, 3) we use full convolution neural networks to give output of size (256, 256, 3) for an input of (256, 256). We use 3 architectures, lenet-FCN (converted to FCNN), resnet-FCN and custom-FCN.
Results are bit embarassing.
- Lenet-FCN
- Resnet-FCN
- Custom-FCN
Learnings
- Investigate as to why model is not performing well in segmenting. Having a good line segmentor is critical for our OCR pipeline.
Finally, all pieces from above experiments come together. To recap, we have a Line Predictor Model from experiment-2 which takes in input images of lines and predicts the characters in the line. And we have a Line Detector Model from experiment-3 which segments paragraphs into line regions.
Do you see the whole picture coming together? No?
- Given an image like the one above, we want a model that returns all the text in the image.
- First step, we would use Line Detector Model. This model will segment image into lines.
- We will extract crops of the image corresponding to the line regions obtained from above line and pass it to Line Predictor Model which will predict what characters are present in the line region.
- Sure enough if both the models are well trained, we will get excellent results!
Now that we have full end-to-end model, we can run the same model on a web server or create an android app.