/HocrConverter

Create PDFs and plain text from hOCR documents

Primary LanguagePythonMIT LicenseMIT

HocrConverter

Create PDFs and plain text from hOCR documents

Originally from jbrinley See https://github.com/jbrinley/HocrConverter and http://xplus3.net/2009/04/02/convert-hocr-to-pdf/

Changes by C.Holtermann

Original script didn't work for me so I made some changes to make it work for me

My configuration is ocropus 0.7 and tesseract 3.02.02

Included some aspects from the fork of https://github.com/zw/HocrConverter:

Some command line arguments:

  • draw bounding boxes
  • draw text
  • inverse height ( tesseract and ocropus count differently )
  • multiple pages
  • include Images ( from hOCR or via command line )
  • verbosity

For command line parsing and validation I use some external libraries:

  • docopt
  • schema

Like this the script is rather something to understand the concept.

Maybe it's useful for others trying to understand OCR.

Changes by tristelune1

  • this script is for python3
  • text is search recursively in span tags

Work in progress.