/alto-tools

Python script for performing various operations on ALTO XML files

Primary LanguagePythonApache License 2.0Apache-2.0

alto-tools

image

Python3 script for performing various operations on ALTO files.

Usage

  • extract UTF-8 text content from ALTO file

    python3 alto_tools.py alto.xml -t

  • extract page OCR confidence score from ALTO file

    python3 alto_tools.py alto.xml -c

  • extract bounding boxes of illustrations from ALTO file

    python3 alto_tools.py alto.xml -l

Planned

  • write output to file(s) - currently all output is sent to stdout

    python3 alto-tools.py alto.xml [OPTION] -o