This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022). The paper is available at this link.
- We use the models pre-trained on ImageNet. The ImageNet pre-trained SwinTransformer backbone is obtained from SwinT_detectron2.
SWINTS-swin-english-pretrain [config] | model_Google Drive | model_BaiduYun PW: 954t
SWINTS-swin-Total-Text [config] | model_Google Drive | model_BaiduYun PW: tf0i
SWINTS-swin-ctw [config] | model_Google Drive | model_BaiduYun PW: 4etq
SWINTS-swin-icdar2015 [config] | model_Google Drive | model_BaiduYun PW: 3n82
SWINTS-swin-ReCTS [config] | model_Google Drive | model_BaiduYun PW: a4be
SWINTS-swin-vintext [config] | model_Google Drive | model_BaiduYun PW: slmp
- Python=3.8
- PyTorch=1.8.0, torchvision=0.9.0, cudatoolkit=11.1
- OpenCV for visualization
- Install the repository (we recommend to use Anaconda for installation.)
conda create -n SWINTS python=3.8 -y
conda activate SWINTS
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install opencv-python
pip install scipy
pip install shapely
pip install rapidfuzz
pip install timm
pip install Polygon3
git clone https://github.com/mxin262/SwinTextSpotter.git
cd SwinTextSpotter
python setup.py build develop
- dataset path
datasets
|_ totaltext
| |_ train_images
| |_ test_images
| |_ totaltext_train.json
| |_ weak_voc_new.txt
| |_ weak_voc_pair_list.txt
|_ mlt2017
| |_ train_images
| |_ annotations/icdar_2017_mlt.json
.......
Downloaded images
- ICDAR2017-MLT [image]
- Syntext-150k:
- ICDAR2015 [image]
- ICDAR2013 [image]
- Total-Text_train_images [image]
- Total-Text_test_images [image]
- ReCTs [images&label] PW: 2b4q
- LSVT [images&label] PW: 9uh1
- ArT [images&label] PW: 2865
- SynChinese130k [images][label]
- Vintext_images [image]
Downloaded label[Google Drive] [BaiduYun] PW: 46vd
Downloader lexicion[Google Drive] and place it to corresponding dataset.
You can also prepare your custom dataset following the example scripts. [example scripts]
To evaluate on Total Text, CTW1500, ICDAR2015, first download the zipped annotations with
cd datasets
mkdir evaluation
cd evaluation
wget -O gt_ctw1500.zip https://cloudstor.aarnet.edu.au/plus/s/xU3yeM3GnidiSTr/download
wget -O gt_totaltext.zip https://cloudstor.aarnet.edu.au/plus/s/SFHvin8BLUM4cNd/download
wget -O gt_icdar2015.zip https://drive.google.com/file/d/1wrq_-qIyb_8dhYVlDzLZTTajQzbic82Z/view?usp=sharing
wget -O gt_vintext.zip https://drive.google.com/file/d/11lNH0uKfWJ7Wc74PGshWCOgSxgEnUPEV/view?usp=sharing
- Pretrain SWINTS (e.g., with Swin-Transformer backbone)
python projects/SWINTS/train_net.py \
--num-gpus 8 \
--config-file projects/SWINTS/configs/SWINTS-swin-pretrain.yaml
- Fine-tune model on the mixed real dataset
python projects/SWINTS/train_net.py \
--num-gpus 8 \
--config-file projects/SWINTS/configs/SWINTS-swin-mixtrain.yaml
- Fine-tune model
python projects/SWINTS/train_net.py \
--num-gpus 8 \
--config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml
- Evaluate SWINTS (e.g., with Swin-Transformer backbone)
python projects/SWINTS/train_net.py \
--config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml \
--eval-only MODEL.WEIGHTS ./output/model_final.pth
- Visualize the detection and recognition results (e.g., with ResNet50 backbone)
python demo/demo.py \
--config-file projects/SWINTS/configs/SWINTS-swin-finetune-totaltext.yaml \
--input input1.jpg \
--output ./output \
--confidence-threshold 0.4 \
--opts MODEL.WEIGHTS ./output/model_final.pth
Adelaidet, Detectron2, ISTR, SwinT_detectron2, Focal-Transformer and MaskTextSpotterV3.
If our paper helps your research, please cite it in your publications:
@article{huang2022swints,
title = {SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition},
author = {Mingxin Huang and YuLiang liu and Zhenghao Peng and Chongyu Liu and Dahua Lin and Shenggao Zhu and Nicholas Yuan and Kai Ding and Lianwen Jin},
journal={arXiv preprint arXiv:2203.10209},
year = {2022}
}
For commercial purpose usage, please contact Dr. Lianwen Jin: eelwjin@scut.edu.cn
Copyright 2019, Deep Learning and Vision Computing Lab, South China China University of Technology. http://www.dlvc-lab.net