🎰TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

$$ TextSSR ~ Capability ~ Showcase. $$

📢News

[2024.12.05] - The training dataset and generative dataset are released!

[2024.12.04] - We released the latest model and online demo, check on ModelScope.

[2024.12.03] - Our paper is available at here.

📝TODOs

Provide publicly checkpoints and gradio demo
Release TextSSR-benchmark dataset and evaluation code
Release processed AnyWord-lmdb dataset
Release our scene text synthesis dataset, TextSSR-F
Release training and inference code

💎Visualization

$$ Model ~ Architecture ~ Display. $$

$$ Data ~ Synthesis ~ Pipeline. $$

$$ Results ~ Presentation. $$

🛠Installation

Environment Settings

Clone the TextSSR Repository:

git clone https://github.com/YesianRohn/TextSSR.git
cd TextSSR

Create a New Environment for TextSSR:

conda create -n textssr python=3.10
conda activate textssr

Install Required Dependencies:
- Install PyTorch, TorchVision, Torchaudio, and the necessary CUDA version:
```
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
```
- Install the rest of the dependencies listed in the requirements.txt file:
```
pip install -r requirements.txt
```
- Install our modified diffusers:
```
cd diffusers
pip install -e .
cd ..
```

Checkpoints/Data Preparation

Data Preparation:
- You can use the Anyword-3M dataset provided by Anytext. However, you will need to modify the data loading code to use AnyWordDataset instead of AnyWordLmdbDataset.
- If you have obtained our AnyWord-lmdb dataset, simply place it in the TextSSR folder.
Font File Preparation:
- You can either download the Alibaba PuHuiTi font from here, which should be named AlibabaPuHuiTi-3-85-Bold.ttf, or you can use your own custom font file.
- Place your font file in the TextSSR folder.
Model Preparation:

If you want to train the model from scratch, first download the SD2-1 model from Hugging Face.
- Place the downloaded model in the model folder.
- During the training process, you will obtain several model checkpoints. These should be placed sequentially in the model folder as follows:
  - vae_ft (trained VAE model)
  - step1 (trained CDM after step 1)
  - step2 (trained CDM after step 2)

After the preparations outlined above, you will have the following file structure:

TextSSR/
├── model/
│   ├── stable-diffusion-v2-1
│   ├── vae_ft
│       ├── checkpoint-x/
│       	├── vae/
│       	└── ...
│   ├── step1
│       ├── checkpoint-x/
│       	├── unet/
│       	└── ...
│   ├── step2
│       ├── checkpoint-x/
│       	├── unet/
│       	└── ...
│   └── AnyWord-lmdb/                      
│       ├── step1_lmdb/
│       ├── step2-lmdb/
├── AlibabaPuHuiTi-3-85-Bold.ttf
├── ...(the same as the GitHub code)

🚂 Training

Step 1: Fine-tune the VAE:

accelerate launch --num_processes 8 train_vae.py --config configs/train_vae_cfg.py

Step 2: First stage of CDM training:

accelerate launch --num_processes 8 train_diff.py --config configs/train_diff_step1_cfg.py

Step 3: Second stage of CDM training:

accelerate launch --num_processes 8 train_diff.py --config configs/train_diff_step2_cfg.py

🔍 Inference

Ensure the benchmark path is correctly set in infer.py.
Run the inference process with:
```
python infer.py
```

This will start the inference and generate the results.

📊Evaluation

TBD

🔗Citation

@article{ye2024textssr,
  title={TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition},
  author={Ye, Xingsong and Du, Yongkun and Tao, Yunbo and Chen, Zhineng},
  journal={arXiv preprint arXiv:2412.01137},
  year={2024}
}

🌟 Acknowledgements

Many thanks to these great projects for their contributions, which have influenced and supported our work in various ways: SynthText, TextOCR, DiffUTE, Textdiffuser & Textdiffuser-2, AnyText, UDiffText, SceneVTG, and SVTRv2.

Special thanks also go to the training frameworks: STR-Fewer-Labels and OpenOCR.