Character Aware Alignment Contrastive Learning for Chinese Scene Text Recognition

Official PyTorch implementation for Character Aware Alignment Contrastive Learning for Chinese Scene Text Recognition (CAAC).

Abstract

Indistinguishably caused by enormous character categories and significant character similarity is main challenge for Chinese scene text recognition. In this paper, we explore the topic of discriminative Chinese character representations extraction to cope with the above difficulties. The Character Aware Alignment Contrastive learning (CAAC) is proposed for Chinese scene text recognition, which achieves superior performance in an overall concise framework. We leverage the character aware properties of attentional decoder to instantiate character level contrastive learning with more fine-grained atomic elements than previous sub-word level self-supervised contrastive learning based text recognition methods. In addition, the projection-free strategy for directly coupling task loss and supervised contrastive loss is investigated to jointly guide the recognizer to be flexible to Chinese character identification and tradeoff between intra- and inter-domain generalization. All the proposed strategies are plug-and-play, we demonstrate that the CAAC induces stable performance boosting to existing methods and projection-free brings superior cross-domain generalization than three projection heads. Extensive experiments have been conducted on multiple text recognition benchmarks including a self-collected ship license plate dataset to verify the recognition performance, generalization capability and transferability. The experimental results show that our method outperforms previous methods by 2.81% and 1.34% on Chinese Scene and Web text recognition datasets.

Runtime Environment

Inference requires PyTorch >= 1.7.1
For training and evaluation, install the dependencies

pip install -r requirements.txt

Datasets

Download Scene and Web lmdb dataset for training and evaluation.
For cross-domain generalization analysis, the commonly used English datasets can be downloaded in IIIT5K (IIIT), Street View Text (SVT), ICDAR 2015 (IC15 1811), Street View Text Perspective (SVTP), and WordArt.
The new collected SLPR and SLPR-P dataset consists of 6,922 artistic text images with 5,131 training images and 1,791 testing images. The dataset is available at Project.

Training and Evaluation

Training

python main.py --config=configs/xxx.yaml

Evaluation

python main.py --config=configs/xxx.yaml --phase test --image_only

Pretrained Models

Get the pretrained models from GoogleDrive. Performances of some pretrained models are summaried as follows, ACC / NED follow the percentage format and decimal format, respectively.

Model	Scene	Web	log (Scene)	log (Web)	SLPR	SLPR-P	log (SLPR)	log (SLPR-P)
ResNet-45-CAAC	65.11 / 0.815	63.40 / 0.801	Link	Link	92.85 / 0.977	86.60 / 0.945	Link	Link
ResNet-45-no-CAAC	64.07 / 0.809	62.86 / 0.799	Link	Link	92.85 / 0.976	88.50 / 0.954	Link	Link
Swin-S-CAAC	74.91 / 0.88.	64.74 / 0.813	Link	Link	-	-	-	-
Swin-S-no-CAAC	72.93 / 0.873	62.64 / 0.800	Link	Link	-	-	-	-

Evaluation logs

Please click the hyperlinks to see the detailed experimental results, following the format of ([gt] [pt])

ResNet-45-Scene-CAAC: Link
Swin-S-Scene-CAAC: Link
ResNet-45-Web-CAAC: Link
Swin-S-Web-CAAC: Link

Acknowledgements

This implementation is based on the repository ABINet, SupContrast, FudanVI /benchmarking-chinese-text-recognition, WordArt.

License

Follow the ABINet, this project is only free for academic research purposes, licensed under the 2-clause BSD License - see the LICENSE file for details.

icecream-Tnak/CAAC