Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr
This package offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr. (Each processor is a parameterizable step in a configurable workflow of the OCR-D functional model. There are usually various alternative processor implementations for each step. Data is represented with METS and PAGE.)
It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition.
Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via AlternativeImage
, text results via TextEquiv
, font attributes via TextStyle
, script via @primaryScript
, deskewing via @orientation
, cropping via Border
and segmentation via Region
/ TextLine
/ Word
elements with Coords/@points
.
This is the best option if you want to run the software in a container.
You need to have Docker
docker pull ocrd/tesserocr
To run with docker:
docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...
This is the best option if you want to use the stable, released version.
NOTE
ocrd_tesserocr requires Tesseract >= 4.1.0. The Tesseract packages bundled with Ubuntu < 19.10 are too old. If you are on Ubuntu 18.04 LTS, please use Alexander Pozdnyakov's PPA repository, which has up-to-date builds of Tesseract and its dependencies:
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt-get install python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr wget
pip install ocrd_tesserocr
Use this option if you want to change the source code or install the latest, unpublished changes.
We strongly recommend to use venv.
git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
# install Tesseract:
sudo make deps-ubuntu # or manually from git or via ocrd_all
# install tesserocr and ocrd_tesserocr:
make deps # or pip install -r requirements
make install # or pip install .
Tesseract comes with synthetically trained models for languages (tesseract-ocr-{eng,deu,frk,...}
or scripts (tesseract-ocr-script-{latn,frak,...}
). In addition, various models trained on scan data are available from the community.
Note that since all OCR-D processors must resolve file/data resources in a standardized way, ocrd-tesserocr-recognize
expects the recognition models to be installed in $XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize
(where, usually, $XDG_DATA_HOME=$HOME/.local/share
). This is the default resource location used by ocrd resmgr
, which you can use to download and list models:
ocrd resmgr --help
(However, for backwards compatibility, this can be overriden by defining $TESSDATA_PREFIX
in the environment. In this case users must install models manually – by linking/copying or downloading them into that directory. The same is true for the non-default location used by the system packages tesseract-ocr-*
, which is usually /usr/share/tesseract-ocr/4.00/tessdata
.)
Cf. OCR-D model guide.
Models always use the filename suffix .traineddata
, but are just loaded by their basename. You will need at least eng
and osd
(even for segmentation and deskewing), probably also Latin
and Fraktur
etc.
As of v0.13.1, you can configure ocrd-tesserocr-recognize
to select models dynamically segment by segment, either via custom conditions on the PAGE-XML annotation (presented as XPath rules), or by automatically choosing the model with highest confidence.
For details, see docstrings in the individual processors and ocrd-tool.json descriptions,
or simply --help
.
Available OCR-D processors are:
- ocrd-tesserocr-crop (simplistic)
- sets
Border
of pages and addsAlternativeImage
files to the output fileGrp
- sets
- ocrd-tesserocr-deskew (for skew and orientation; mind
operation_level
)- sets
@orientation
of regions or pages and addsAlternativeImage
files to the output fileGrp
- sets
- ocrd-tesserocr-binarize (Otsu – not recommended)
- adds
AlternativeImage
files to the output fileGrp
- adds
- ocrd-tesserocr-recognize (optionally including segmentation; mind
segmentation_level
andtextequiv_level
)- adds
TextRegion
s,TableRegion
s,ImageRegion
s,MathsRegion
s,SeparatorRegion
s,NoiseRegion
s,ReadingOrder
andAlternativeImage
toPage
and sets their@orientation
(optionally) - adds
TextRegion
s toTableRegion
s and sets their@orientation
(optionally) - adds
TextLine
s toTextRegion
s (optionally) - adds
Word
s toTextLine
s (optionally) - adds
Glyph
s toWord
s (optionally) - adds
TextEquiv
- adds
- ocrd-tesserocr-segment (all-in-one segmentation – recommended; delegates to
recognize
)- adds
TextRegion
s,TableRegion
s,ImageRegion
s,MathsRegion
s,SeparatorRegion
s,NoiseRegion
s,ReadingOrder
andAlternativeImage
toPage
and sets their@orientation
- adds
TextRegion
s toTableRegion
s and sets their@orientation
- adds
TextLine
s toTextRegion
s - adds
Word
s toTextLine
s - adds
Glyph
s toWord
s
- adds
- ocrd-tesserocr-segment-region (only regions – with overlapping bboxes; delegates to
recognize
)- adds
TextRegion
s,TableRegion
s,ImageRegion
s,MathsRegion
s,SeparatorRegion
s,NoiseRegion
s andReadingOrder
toPage
and sets their@orientation
- adds
- ocrd-tesserocr-segment-table (only table cells; delegates to
recognize
)- adds
TextRegion
s toTableRegion
s
- adds
- ocrd-tesserocr-segment-line (only lines – from overlapping regions; delegates to
recognize
)- adds
TextLine
s toTextRegion
s
- adds
- ocrd-tesserocr-segment-word (only words; delegates to
recognize
)- adds
Word
s toTextLine
s
- adds
- ocrd-tesserocr-fontshape (only text style – via Tesseract 3 models)
- adds
TextStyle
toWord
s
- adds
The text region @type
s detected are (from Tesseract's PolyBlockType):
paragraph
: normal block (aligned with others in the column)floating
: unaligned block (is in a cross-column pull-out region
)heading
: block thatspans more than one column
caption
: block fortext that belongs to an image
If you are unhappy with these choices, consider post-processing with a dedicated custom processor in Python, or by modifying the PAGE files directly (e.g. xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml
).
All segmentation is currently done as bounding boxes only by default, i.e. without precise polygonal outlines. For dense page layouts this means that neighbouring regions and neighbouring text lines may overlap a lot. If this is a problem for your workflow, try post-processing like so:
- after line segmentation: use
ocrd-cis-ocropy-resegment
for polygonalization, orocrd-cis-ocropy-clip
on the line level - after region segmentation: use
ocrd-segment-repair
withplausibilize
(andsanitize
after line segmentation)
It also means that Tesseract should be allowed to segment across multiple hierarchy levels at once, to avoid introducing inconsistent/duplicate text line assignments in text regions, or word assignments in text lines. Hence,
- prefer
ocrd-tesserocr-recognize
withsegmentation_level=region
overocrd-tesserocr-segment
followed byocrd-tesserocr-recognize
, if you want to do all in one with Tesseract, - prefer
ocrd-tesserocr-recognize
withsegmentation_level=line
overocrd-tesserocr-segment-line
followed byocrd-tesserocr-recognize
, if you want to do everything but region segmentation with Tesseract, - prefer
ocrd-tesserocr-segment
overocrd-tesserocr-segment-region
followed by (ocrd-tesserocr-segment-table
and)ocrd-tesserocr-segment-line
, if you want to do everything but recognition with Tesseract.
However, you can also run ocrd-tesserocr-segment*
and ocrd-tesserocr-recognize
with shrink_polygons=True
to get polygons by post-processing each segment, shrinking to the convex hull of all its symbol outlines.
make test
This downloads some test data from https://github.com/OCR-D/assets under repo/assets
, and runs some basic test of the Python API as well as the CLIs.
Set PYTEST_ARGS="-s --verbose"
to see log output (-s
) and individual test results (--verbose
).