Optical Character Recognition (OCR) with PyTorch

This repository contains a PyTorch implementation of an Optical Character Recognition (OCR) system utilizing a convolutional neural network (CNN) feature extractor followed by a bidirectional LSTM (BiLSTM) for sequence modeling.

Feature Extractor

The feature extractor architecture consists of several convolutional layers followed by batch normalization, ReLU activation, and max-pooling operations. Here is the architecture of the feature extractor:

self.feature_extractor = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, padding=1),
    nn.ReLU(True),
    nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(True),
    nn.MaxPool2d(2), 

    nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
    nn.ReLU(True),
    nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(True),
    nn.MaxPool2d(2), 

    nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
    nn.ReLU(True),
    nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
    nn.BatchNorm2d(256),
    nn.ReLU(True),
    nn.MaxPool2d(2),

    nn.Conv2d(256, 512, kernel_size=7, stride=1, padding=0),
    nn.ReLU(True),
    nn.MaxPool2d((2, 1), (2, 1)),

    nn.Flatten(2)
)

Sequence Modeling

The sequence modeling component utilizes a bidirectional LSTM (BiLSTM) to capture sequential information from the features extracted by the CNN. Here is the architecture of the BiLSTM:

class BidirectionalLSTM(nn.Module):

    def __init__(self, input_size, hidden_size, output_size):
        super(BidirectionalLSTM, self).__init__()
        self.rnn = nn.LSTM(input_size, hidden_size, bidirectional=True, batch_first=True)
        self.linear = nn.Linear(hidden_size * 2, output_size)

    def forward(self, input):
        self.rnn.flatten_parameters()
        recurrent, _ = self.rnn(input)  # batch_size x T x input_size -> batch_size x T x (2*hidden_size)
        output = self.linear(recurrent)  # batch_size x T x output_size
        return output

OCR Model

The OCR model combines the feature extractor and the sequence modeling components. It consists of the following architecture:

class OCR_Model(nn.Module):
    def __init__(self, num_classes):
        super(OCR_Model, self).__init__()
        self.feature_extractor = feature_extractor()
        self.SequenceModeling = nn.Sequential(
            BidirectionalLSTM(512, 512, 512),
            BidirectionalLSTM(512, 512, 512)
        )
        self.linear = nn.Linear(512, num_classes+1)

    def forward(self,x):
        features =  self.feature_extractor(x)
        lstm_out =  self.SequenceModeling(features)
        return self.linear(lstm_out)

Usage

To train the OCR model, you can follow these steps:

  • Install Dependencies $ pip install -r requirements.txt
  • Prepare your dataset and ensure it is compatible with the model input format.
  • Define the model configuration and instantiate the OCR model.
  • Train the model using your dataset and monitor the loss and accuracy metrics.

Loss Curve

Example OCR

ROC Curve

Example OCR

License

This project is licensed under the MIT License - see the LICENSE file for details.