Multimodal Semi-Supervised Learning for Text Recognition

The official code implementation of SemiMTR Paper | Pretrained Models | SeqCLR Paper | Citation | Demo.

Aviad Aberdam, Roy Ganz, Shai Mazor, Ron Litman

We introduce a multimodal semi-supervised learning algorithm for text recognition, which is customized for modern vision-language multimodal architectures. To this end, we present a unified one-stage pretraining method for the vision model, which suits scene text recognition. In addition, we offer a sequential, character-level, consistency regularization in which each modality teaches itself. Extensive experiments demonstrate state-of-the-art performance on multiple scene text recognition benchmarks.

Figures

Figure 1: SemiMTR vision model pretraining: Contrastive learning

Figure 2: SemiMTR model fine-tuning: Consistency regularization

Getting Started

Run Demo with Pretrained Model

Dependencies

Inference and demo requires PyTorch >= 1.7.1
For training and evaluation, install the dependencies

pip install -r requirements.txt

Pretrained Models

Download pretrained models:

Pretrained vision models:

SemiMTR Vision Model Real-L + Real-U

Pretrained language model:

ABINet Language Model

For fine-tuning SemiMTR without vision and language pretraining, locate the above models in a workdir directory, as follows:

workdir
├── semimtr_vision_model_real_l_and_u.pth
├── abinet_language_model.pth
└── semimtr_real_l_and_u.pth

SemiMTR Models Accuracy

Training Data	IIIT	SVT	IC13	IC15	SVTP	CUTE	Avg.	COCO	RCTW	Uber	ArT	LSVT	MLT19	ReCTS	Avg.
Synth (ABINet)	96.4	93.2	95.1	82.1	89.0	89.2	91.2	63.1	59.7	39.6	68.3	59.5	85.0	86.7	52.0
Real-L+U	97.0	95.8	96.1	84.7	90.7	94.1	92.8	72.2	76.1	58.5	71.6	77.1	90.4	92.4	65.4
Real-L+U+Synth	97.4	96.8	96.5	84.7	92.9	95.1	93.3	73.0	75.7	58.6	72.4	77.5	90.4	93.1	65.8
Real-L+U+TextOCR	97.3	97.7	96.9	86.0	92.2	94.4	93.7	73.8	77.7	58.6	73.5	78.3	91.3	93.3	66.1

Datasets

Download preprocessed lmdb dataset for training and evaluation. Link
For training the language model, download WikiText103. Link
The final structure of data directory can be found in DATA.md.

Training

Pretrain vision model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_pretrain_vision_model.yaml

Pretrain language model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml

Train SemiMTR

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_finetune.yaml

Note:

You can set the checkpoint path for vision and language models separately for specific pretrained model, or set to None to train from scratch

Training ABINet

Pre-train vision model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_pretrain_vision_model.yaml

Pre-train language model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml

Train ABINet

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_finetune.yaml

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/semimtr_finetune.yaml --run_only_test

Arguments:

--checkpoint /path/to/checkpoint set the path of evaluation model
--test_root /path/to/dataset set the path of evaluation dataset
--model_eval [alignment|vision] which sub-model to evaluate

Citation

If you find our method useful for your research, please cite

@article{aberdam2022multimodal,
  title={Multimodal Semi-Supervised Learning for Text Recognition},
  author={Aberdam, Aviad and Ganz, Roy and Mazor, Shai and Litman, Ron},
  journal={arXiv preprint arXiv:2205.03873},
  year={2022}
}

@inproceedings{aberdam2021sequence,
  title={Sequence-to-sequence contrastive learning for text recognition},
  author={Aberdam, Aviad and Litman, Ron and Tsiper, Shahar and Anschel, Oron and Slossberg, Ron and Mazor, Shai and Manmatha, R and Perona, Pietro},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={15302--15312},
  year={2021}
}

Acknowledgements

This implementation is based on the repository ABINet.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Contact

Feel free to contact us if there is any question: Aviad Aberdam

amazon-science/semimtr-text-recognition