The official code implementation of SemiMTR Paper | Pretrained Models | SeqCLR Paper | Citation | Demo.
Aviad Aberdam, Roy Ganz, Shai Mazor, Ron Litman
We introduce a multimodal semi-supervised learning algorithm for text recognition, which is customized for modern vision-language multimodal architectures. To this end, we present a unified one-stage pretraining method for the vision model, which suits scene text recognition. In addition, we offer a sequential, character-level, consistency regularization in which each modality teaches itself. Extensive experiments demonstrate state-of-the-art performance on multiple scene text recognition benchmarks.
Figure 1: SemiMTR vision model pretraining: Contrastive learning
Figure 2: SemiMTR model fine-tuning: Consistency regularization
- Inference and demo requires PyTorch >= 1.7.1
- For training and evaluation, install the dependencies
pip install -r requirements.txt
Download pretrained models:
Pretrained vision models:
Pretrained language model:
For fine-tuning SemiMTR without vision and language pretraining, locate the above models in a workdir
directory, as follows:
workdir
├── semimtr_vision_model_real_l_and_u.pth
├── abinet_language_model.pth
└── semimtr_real_l_and_u.pth
Training Data | IIIT | SVT | IC13 | IC15 | SVTP | CUTE | Avg. | COCO | RCTW | Uber | ArT | LSVT | MLT19 | ReCTS | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Synth (ABINet) | 96.4 | 93.2 | 95.1 | 82.1 | 89.0 | 89.2 | 91.2 | 63.1 | 59.7 | 39.6 | 68.3 | 59.5 | 85.0 | 86.7 | 52.0 |
Real-L+U | 97.0 | 95.8 | 96.1 | 84.7 | 90.7 | 94.1 | 92.8 | 72.2 | 76.1 | 58.5 | 71.6 | 77.1 | 90.4 | 92.4 | 65.4 |
Real-L+U+Synth | 97.4 | 96.8 | 96.5 | 84.7 | 92.9 | 95.1 | 93.3 | 73.0 | 75.7 | 58.6 | 72.4 | 77.5 | 90.4 | 93.1 | 65.8 |
Real-L+U+TextOCR | 97.3 | 97.7 | 96.9 | 86.0 | 92.2 | 94.4 | 93.7 | 73.8 | 77.7 | 58.6 | 73.5 | 78.3 | 91.3 | 93.3 | 66.1 |
- Download preprocessed lmdb dataset for training and evaluation. Link
- For training the language model, download WikiText103. Link
- The final structure of
data
directory can be found inDATA.md
.
- Pretrain vision model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_pretrain_vision_model.yaml
- Pretrain language model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml
- Train SemiMTR
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_finetune.yaml
Note:
- You can set the
checkpoint
path for vision and language models separately for specific pretrained model, or set toNone
to train from scratch
- Pre-train vision model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_pretrain_vision_model.yaml
- Pre-train language model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml
- Train ABINet
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_finetune.yaml
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/semimtr_finetune.yaml --run_only_test
--checkpoint /path/to/checkpoint
set the path of evaluation model--test_root /path/to/dataset
set the path of evaluation dataset--model_eval [alignment|vision]
which sub-model to evaluate
If you find our method useful for your research, please cite
@article{aberdam2022multimodal,
title={Multimodal Semi-Supervised Learning for Text Recognition},
author={Aberdam, Aviad and Ganz, Roy and Mazor, Shai and Litman, Ron},
journal={arXiv preprint arXiv:2205.03873},
year={2022}
}
@inproceedings{aberdam2021sequence,
title={Sequence-to-sequence contrastive learning for text recognition},
author={Aberdam, Aviad and Litman, Ron and Tsiper, Shahar and Anschel, Oron and Slossberg, Ron and Mazor, Shai and Manmatha, R and Perona, Pietro},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={15302--15312},
year={2021}
}
This implementation is based on the repository ABINet.
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.
Feel free to contact us if there is any question: Aviad Aberdam