Update 2019-05-21: I rewrite this repo base on fairseq, refer to my implementation here. And the fairseq text recognition is under construction. Very pleasure for suggestion and cooperation in the fairseq text recognition project.
This software implements the Convolutional Recurrent Neural Network (CRNN) in pytorch in paper:
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition, Baoguang Shi, Xiang Bai, Cong Yao, PAMI 2017 [arXiv]
This code implements: (args.arch)
- DenseNet + CTC loss (
densenet_cifar
,densenet121
with pre-trained model) - ResNet + CTC loss (
resnet_cifar
) - MobileNetV2 + CTC loss (
mobilenetv2_cifar
with pre-trained model) - ShuffleNetV2 + CTC loss (
shufflenetv2_cifar
)
Remark: Current network architecture only implement CNN backbone
+ Fully connected layers (FC)
+ CTC loss
. Where the CNN
acts as a subsample and encoder layer, the FC
acts as a decoder layer, and the CTC loss
here acts as a justification of a sequence's labels with previous network's forecasting. More detail refer to issue #4 and issue #6.
In order to run this toolbox you will need:
- Python3 (tested with Python 3.6+)
- PyTorch deep learning framework (tested with version 1.0.1)
The demo reads an example image and recognizes its text content. See the demo notebook for all the details.
Example image:
Expected output:
-停--下--来--,--看--着--那--些--握--着------ => 停下来,看着那些握着
- Navigate (
cd
) to the root of the toolbox[YOUR_CRNN_ROOT]
. - Resize the height of an image to 32, and keep the spatial ratio of the image.
Refer to YCG09's SynthText, the image size is 32x280, origin image can be downloaded from BaiduYun (pw: lu7m), untar it into directory [DATASET_ROOT_DIR]/images
.
In each line in the annotation file, the format is:
img_path encode1 encode2 encode3 encode4 encode5 ...
where the encode
is the sequence's encode token code.
For example, there is task identifying numbers of an image, the Alphabet
is "0123456789". And there is an image named "00320_00091.jpg" in folder [DATA]/images
, its constant is "99353361056742", after conversion, there should be a line in the [DATA]/train.txt
or [DATA]/dev.txt
.
00320_00091.jpg 10 10 4 6 4 4 7 2 1 6 7 8 5 3
Note: the encoder code 0
is reserved for CTC blank token.
Altogether 5989 characters, containing Chinese characters, English letters, numbers and punctuation, can be downloaded from OneDrive or BaiduYun (pw: d654), put the downloaded file alphabet_decode_5990.txt
into directory [DATASET_ROOT_DIR]
.
For the limitation of GPU, I have only trained the CRNN with densenet121
architectures for only 1 epoch and mobilenetv2_cifar
architectures for only 2 epochs.
The pre-trained densenet121
checkpoint can be found from OneDrive or BaiduYun (pw: riuh) (Trained for 1 epoch, with accuracy 97.55%), and the pre-trained mobilenetv2_cifar
checkpoint can be found from OneDrive or BaiduYun(pw: n2rg) (Trained for 2 epochs, with accuracy 97.83%).
Training strategy:
python ./main.py --dataset-root [DATASET_ROOT_DIR] --arch densenet121
--alphabet [DATASET_ROOT_DIR]/alphabet_decode_5990.txt
--lr 5e-5 --optimizer rmsprop --gpu-id [GPU-ID]
--not-pretrained
The initial learning rate of training densenet121
architecture is 5e-5
, and the initial learning of training mobilenetv2_cifar
architecture is 5e-4
.
Use trained model to test:
python ./main.py --dataset-root [DATASET_ROOT_DIR] --arch densenet121
--alphabet [DATASET_ROOT_DIR]/alphabet_decode_5990.txt
--lr 5e-5 --optimizer rmsprop --gpu-id [GPU-ID]
--resume densenet121_pretrained.pth.tar --test-only