/Scene-Text-Recognition

Text recognition (optical character recognition) with deep learning methods in farsi.

Primary LanguagePythonMIT LicenseMIT

Scene Text Recognition

Scene Text Recognition With Deep Learning Methods In Farsi.

Quick Links

Dependencies

  • Install Dependencies $ pip install -r requirements.txt
  • Download Pretrained Weights Here

Getting Started

Fig. 1: Model architectur.

  • Project Structure
.
├── src
│   ├── nn
│   │   ├── feature_extractor.py
│   │   ├── layers.py
│   │   └── ocr_model.py
│   └── utils
│       ├── dataset.py
│       ├── labelConverter.py
│       ├── loss_calculator.py
│       ├── misc.py
│       ├── trainUtils.py
│       └── transforms.py
├── config.py
└── train.py
  • place dataset path in config.py file.
ds_path = {
    "train_ds" : "path/to/train/dataset",
    "test_ds" : "path/to/test/dataset",
}
  • DataSet Structure (each image must eventually contain a word)
.
├── Images
│   ├── img_1.jpg
│   ├── img_2.jpg
│   ├── img_3.jpg
│   ├── img_4.jpg
│   └── img_5.jpg
│   ...
└── labels.json
  • labels.json Contents
{"img_1": "بالا", "img_2": "و", "img_3": "بدانند", "img_4": "چندین", "img_5": "به", ...}

Overview

Training

Objective Function

Denote the training dataset by $\ TD = \langle X_i , Y_i \rangle$ where $\ X_i$ is the training image and $\ Y_i$ is the word label. The training conducted by minimizing the objective function that negative log-likelihood of the conditional probability of word label.

$$O = -\sum_{(X_i, Y_i) \in TD} \log P(Y_i|X_i)$$

This function calculates a cost from an image and its word label, and the modules in the framework are trained end-to-end manner.

Fig. 1: Model Training History.

CTC Loss

CTC takes a sequence $\ H = h_1 , . . . , h_T$ , where $\ T$ is the sequence length, and outputs the probability of $\ \pi$, which is defined as

$$P(\pi|H) = \prod_{t = 1}^T y_{{\pi}_t}^t$$

where $\ y_{{\pi}_t}^t$ is the probability of generating character $\ \pi_t$ at each time step $\ t$.

Model Input Size Recall Precision F1 Params Speed(img/s)
$\ OCR-Base$ $\ 1$ $\ \times$ $\ 64$ $\ \times$ $\ 192$ $\ 0.993$ $\ 0.997$ $\ 0.997$ $\ 35,023,143$ $\ 89.24$

Samples

References

🛡️ License

Project is distributed under MIT License