Convolutional Recurrent Neural Network

Update 2019-05-21: I rewrite this repo base on fairseq, refer to my implementation here. And the fairseq text recognition is under construction. Very pleasure for suggestion and cooperation in the fairseq text recognition project.

This software implements the Convolutional Recurrent Neural Network (CRNN) in pytorch in paper:

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition, Baoguang Shi, Xiang Bai, Cong Yao, PAMI 2017 [arXiv]

What is it?

This code implements: (args.arch)

DenseNet + CTC loss (densenet_cifar, densenet121 with pre-trained model)
ResNet + CTC loss (resnet_cifar)
MobileNetV2 + CTC loss (mobilenetv2_cifar with pre-trained model)
ShuffleNetV2 + CTC loss (shufflenetv2_cifar)

Remark: Current network architecture only implement CNN backbone + Fully connected layers (FC) + CTC loss. Where the CNN acts as a subsample and encoder layer, the FC acts as a decoder layer, and the CTC loss here acts as a justification of a sequence's labels with previous network's forecasting. More detail refer to issue #4 and issue #6.

Prerequisites

In order to run this toolbox you will need:

Python3 (tested with Python 3.6+)
PyTorch deep learning framework (tested with version 1.0.1)

Demo

The demo reads an example image and recognizes its text content. See the demo notebook for all the details.

Example image:

Expected output:

-停--下--来--，--看--着--那--些--握--着------ => 停下来，看着那些握着

Usage

Navigate (cd) to the root of the toolbox [YOUR_CRNN_ROOT].
Resize the height of an image to 32, and keep the spatial ratio of the image.

Datasets

Refer to YCG09's SynthText, the image size is 32x280, origin image can be downloaded from BaiduYun (pw: lu7m), untar it into directory [DATASET_ROOT_DIR]/images.

Annotation file format

In each line in the annotation file, the format is:

img_path encode1 encode2 encode3 encode4 encode5 ...

where the encode is the sequence's encode token code.

For example, there is task identifying numbers of an image, the Alphabet is "0123456789". And there is an image named "00320_00091.jpg" in folder [DATA]/images, its constant is "99353361056742", after conversion, there should be a line in the [DATA]/train.txt or [DATA]/dev.txt.

00320_00091.jpg 10 10 4 6 4 4 7 2 1 6 7 8 5 3

Note: the encoder code 0 is reserved for CTC blank token.

Alphabet

Altogether 5989 characters, containing Chinese characters, English letters, numbers and punctuation, can be downloaded from OneDrive or BaiduYun (pw: d654), put the downloaded file alphabet_decode_5990.txt into directory [DATASET_ROOT_DIR].

Pretrained Model

For the limitation of GPU, I have only trained the CRNN with densenet121 architectures for only 1 epoch and mobilenetv2_cifar architectures for only 2 epochs.

The pre-trained densenet121 checkpoint can be found from OneDrive or BaiduYun (pw: riuh) (Trained for 1 epoch, with accuracy 97.55%), and the pre-trained mobilenetv2_cifar checkpoint can be found from OneDrive or BaiduYun(pw: n2rg) (Trained for 2 epochs, with accuracy 97.83%).

Training

Training strategy:

python ./main.py --dataset-root [DATASET_ROOT_DIR] --arch densenet121
    --alphabet [DATASET_ROOT_DIR]/alphabet_decode_5990.txt
    --lr 5e-5 --optimizer rmsprop --gpu-id [GPU-ID]
    --not-pretrained

The initial learning rate of training densenet121 architecture is 5e-5, and the initial learning of training mobilenetv2_cifar architecture is 5e-4.

Testing

Use trained model to test:

python ./main.py --dataset-root [DATASET_ROOT_DIR] --arch densenet121
    --alphabet [DATASET_ROOT_DIR]/alphabet_decode_5990.txt
    --lr 5e-5 --optimizer rmsprop --gpu-id [GPU-ID]
    --resume densenet121_pretrained.pth.tar --test-only

tianbaochou/crnn.pytorch