/gt_structure_all

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

gt_structure_all

This meta-repository is a comprehensive collection of all official OCR-D Ground Truth repositories with structural annotations (i.e. only layout, but no text).

Together, these datasets make up the OCR-D Structure GT corpus, which contains images and their respective annotations in PAGE format, capturing the structural elements (segments=regions but not lines) of printed pages (with a total of 25441 pages).

It was established as part of the DFG funded project OCR-D.

Data repositories

Cloning the repository with submodules

git clone --recurse-submodules -j8 https://github.com/OCR-D/gt_structure_all.git

Zenodo

zenodo logo

All data records are also published in Zenodo, and thus have a DOI. Whenever changes are made and a new release is created, the respective dataset will receive a new DOI.

Access to the OCR-D datasets in Zenodo via this search.

Text Data

If you wish to incorporate text data into these structural datasets, then please use the datasets or data from gt_structure_dtaText repository.