What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels
Official PyTorch implementation of STR-Fewer-Labels | paper | training and evaluation data | pretrained model |
Jeonghun Baek, Yusuke Matusi, Kiyoharu Aizawa
The University of Tokyo.
-
In STR field, real data was too small < 10K images.
→ There is Implicit common knowledge “We should use synthetic data, since we don't have enough real data to train models sufficiently.”
→ All state-of-the-art (SOTA) methods use large synthetic data (16M).
→ Problem: the study of training STR with fewer real labels is insufficient. If you wonder about the detail, such as why this is a problem, please refer to our paper and supplement. -
We disprove the common knowledge by consolidating recently accumulated public real data and showing that we can train STR models sufficiently with fewer real labels (276K = 1.7% of large synthetic data 16M). In our work, “sufficiently trained” means that the model has similar accuracy as the model trained on large synthetic data, as shown below figure.
-
Subsequently, as a baseline study of STR with fewer labels, we apply simple data augmentations and semi- and self-supervised learning methods. As a result, we obtain a competitive model with only real data, which has better accuracy than the model trained on large synthetic data and similar accuracy to other SOTA methods that use large synthetic data. (see Table 2 in our paper)
This work is a stepping stone toward STR with fewer labels, and we hope this work will facilitate future work on this topic.
- Jun 5, 2021: Initial upload
- Mar 1, 2021: The paper is accepted at CVPR2021.
-
This work was tested with PyTorch 1.6.0, CUDA 10.1 and python 3.6.
-
requirements : lmdb, pillow, torchvision, nltk, natsort, fire, tensorboard, tqdm
pip3 install lmdb pillow torchvision nltk natsort fire tensorboard tqdm
See data.md
-
Download pretrained model
There are 2 models (CRNN or TRBA) and 5 different settings of each model.Setting Description Baseline-synth Model trained on 2 synthetic datasets (MJSynth + SynthText) Baseline-real Model trained on 11 real datasets (Real-L in Table 1 of our paper) Aug Best augmentation setting in our experiments PL Combination of Aug and Pseudo-Label (PL) PR Combination of Aug, PL and RotNet -
Add image files to test into
demo_image/
-
Run demo.py
CUDA_VISIBLE_DEVICES=0 python3 demo.py --model_name TRBA --image_folder demo_image/ \ --saved_model TRBA-Baseline-real.pth
demo images | TRBA-Baseline-synth | TRBA-Baseline-real |
---|---|---|
(ccaloola. | Coca-Cola | |
Line | Hire | |
Lugh | Laugh | |
Gaf | Cafe | |
upbege | BARBREQUE | |
PEOPLE | PEORLE | |
Esciting | ExCiting | |
Signs | Signs | |
BALLY | BALLYS | |
SHAKESHACK | SHAKE SHACK |
-
Train CRNN model with only real data.
CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name CRNN --exp_name CRNN_real
-
Train CRNN with augmentation (For TRBA, use
--Aug Blur5-Crop99
)CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name CRNN --exp_name CRNN_aug --Aug Crop90-Rot15
-
Train CRNN with semi-supervised methods Pseudo Label (PL)
CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name CRNN --exp_name CRNN_PL --Aug Crop90-Rot15 \ --semi Pseudo --model_for_PseudoLabel saved_models/CRNN_aug/best_score.pth
-
Pretrain with RotNet (For TRBA, use
--model_name NR
)CUDA_VISIBLE_DEVICES=0 python3 pretrain.py --model_name NV --exp_name NV_Pretrain_RotNet --self RotNet
Train CRNN with RotNet initialization
CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name CRNN --exp_name CRNN_NVInitRotNet \ --saved_model saved_models/NV_Pretrain_RotNet/best_score.pth --Aug Crop90-Rot15
-
Train with PL + RotNet (PR).
CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name CRNN --exp_name CRNN_PR \ --saved_model saved_models/NV_Pretrain_RotNet/best_score.pth --Aug Crop90-Rot15 \ --semi Pseudo --model_for_PseudoLabel saved_models/CRNN_NVInitRotNet/best_score.pth
Try our best accuracy model TRBA_PR by replacing CRNN to TRBA and --Aug Crop90-Rot15
to --Aug Blur5-Crop99
.
Test CRNN model.
CUDA_VISIBLE_DEVICES=0 python3 test.py --eval_type benchmark --model_name CRNN \
--saved_model saved_models/CRNN_real/best_score.pth
train.py (as a default, evaluate trained model on 6 benchmark datasets at the end of training.)
--train_data
: folder path to training lmdb dataset. default:data_CVPR2021/training/label/
--valid_data
: folder path to validation lmdb dataset. default:data_CVPR2021/validation/
--select_data
: select training data. default is 'label' which means 11 real labeled datasets.--batch_ratio
: assign ratio for each selected data in the batch. default is '1 / number of datasets'.--model_name
: select model 'CRNN' or 'TRBA'.--Aug
: whether to use augmentation |None|Blur|Crop|Rot|--semi
: whether to use semi-supervised learning |None|PL|MT|--saved_model
: assign saved model to use pretrained model such as RotNet and MoCo.--self_pre
: whether to use self-supversied pretrained model |RotNet|MoCo|. default: RotNet
pretrain.py
--train_data
: folder path to training lmdb dataset. default:data_CVPR2021/training/unlabel/
--valid_data
: folder path to validation lmdb dataset. default:data_CVPR2021/validation/
--select_data
: select training data. default is 'unlabel' which means 3 real unlabeled datasets.--model_name
: select model 'NV' for CRNN. 'NR' or 'TR' for TRBA.--self
: whether to use self-supervised learning |RotNet|MoCo|
test.py
--eval_data
: folder path to evaluation lmdb dataset. As a default, when you useeval_type
, this will be set todata_CVPR2021/evaluation/benchmark/
ordata_CVPR2021/evaluation/addition/
--eval_type
: select 'benchmark' to evaluate 6 evaluation datasets. select 'addition' to evaluate 7 additionally collected datasets (used in Table 6 in our supplementary material).--model_name
: select model 'CRNN' or 'TRBA'.--saved_model
: assign saved model to evaluation.
demo.py
--image_folder
: path to image_folder which contains text images. default:demo_image/
--model_name
: select model 'CRNN' or 'TRBA'.--saved_model
: assign saved model to use.
-
Create your own lmdb dataset. You may need
pip3 install opencv-python
toimport cv2
.python3 create_lmdb_dataset.py --inputPath data/ --gtFile data/gt.txt --outputPath result/
At this time,
gt.txt
should be{imagepath}\t{label}\n
For exampletest/word_1.png Tiredness test/word_2.png kills test/word_3.png A ...
-
Modify
--select_data
,--batch_ratio
, andopt.character
, see this issue.
This implementation has been based on the repository deep-text-recognition-benchmark.
Please consider citing this work in your publications if it helps your research.
@inproceedings{baek2021STRfewerlabels,
title={What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels},
author={Baek, Jeonghun and Matsui, Yusuke and Aizawa, Kiyoharu},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2021}
}
Feel free to contact us if there is any question: Jeonghun Baek ku21fang@gmail.com
For code: MIT. For preprocessed datasets: check the license of each dataset in data.md