/SyntheticDocument

Primary LanguagePythonMIT LicenseMIT

The Source Project is from Donut: Please refer to Donut Repo.

SynthDoG 🐶: Synthetic Document Generator

SynthDoG is synthetic document generator for visual document understanding (VDU).

image

Prerequisites

Usage

# Set environment variable (for macOS)
$ export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

synthtiger -o ./outputs/SynthDoG_en -c 50 -w 4 -v template.py SynthDoG config_en.yaml

{'config': 'config_en.yaml',
 'count': 50,
 'name': 'SynthDoG',
 'output': './outputs/SynthDoG_en',
 'script': 'template.py',
 'verbose': True,
 'worker': 4}
{'aspect_ratio': [1, 2],
     .
     .
 'quality': [50, 95],
 'short_size': [720, 1024]}
Generated 1 data (task 3)
Generated 2 data (task 0)
Generated 3 data (task 1)
     .
     .
Generated 49 data (task 48)
Generated 50 data (task 49)
46.32 seconds elapsed

Some important arguments:

  • -o : directory path to save data.
  • -c : number of data to generate.
  • -w : number of workers.
  • -s : random seed.
  • -v : print error messages.

To generate ECJK samples:

# english
synthtiger -o {dataset_path} -c {num_of_data} -w {num_of_workers} -v template.py SynthDoG config_en.yaml

# chinese
synthtiger -o {dataset_path} -c {num_of_data} -w {num_of_workers} -v template.py SynthDoG config_zh.yaml

# japanese
synthtiger -o {dataset_path} -c {num_of_data} -w {num_of_workers} -v template.py SynthDoG config_ja.yaml

# korean
synthtiger -o {dataset_path} -c {num_of_data} -w {num_of_workers} -v template.py SynthDoG config_ko.yaml

Citation

@inproceedings{kim2022donut,
  title     = {OCR-Free Document Understanding Transformer},
  author    = {Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2022}
}