Generate Synthetic pdf files using custmizable content while recording elements coordinates for OCR objection detection, elements including table, stamp, paragraph, list, title, footnote, etc. All fully customizable with config file. Outputs pdf, img, and json for necessary information. Parallel processing supported. (more examples can be found under ./outputs)
- Highly costomizable yaml config file
- Parallel running on multi-processors
- Page elements: title, subtitle, paragraph, table, space, stamp, list, signature, footnote, etc.
- Various outputs
- pdf (generated pdf file)
- jpg (optional: with element bounding boxes)
- json (recording elements coordinates)
- Page:
- page style
- page rotation
- Table:
- nrows
- ncols
- alignment
- space before/after
- header content: Chinese / English / Decimals
- table content: Chinese / English / Decimals / Specials
- fonts / colors
- with / without gridlines
- many more ...
- Paragraph
- sentence
- fonts / colors / words
- alignment
- linespace
- Title:
- Fonts / colors / words
- Number of lines
- Subtitle
- Fonts / colors /words
- Number of lines
- Spacer:
- Spacer height / width
- Signature
- Nubmer of signatures on top of table with random location
- Stamp
- Number of stamps on top of table with random location
- poppler-utils [linux]
- reportlab==3.5.28 [python]
- pdf2image==1.9.0 [python]
- tqdm==4.35.0 [python]
- matplotlib==3.1.1 [python]
- numpuy==1.17.1 [python]
- jsonschema==3.0.2 [python]
- pyyaml==5.3.1 [python]
- opencv==3.4.2 [python]
Below fonts are non-standard such that extracted bounding box is not accurate:
- langting-gbk
- Clone the repo to local
- Install required packages via pip
pip install -r requirements.txt
or conda [recommended]
conda install --file requirements.txt -c conda-forge
-
Update config file (config.yaml) to custom the component of pdf content, director, processors, etc.
-
Run
python main.py