Detects columns and indented lines in an hOCR file. This Python 3 script is used in the NYPL's NYC Space/Time Directory project to extract data from digitized city directories.
Most OCR tools can produce hOCR files — we are using OCRopus. See https://github.com/nypl-spacetime/ocr-scripts for more details.
hocr-detect-columns
was built and tested using Python 3.5, and depends on the following packages:
python3 detect_columns.py /path/to/hocr.html
hocr-detect-columns
will parse hocr.html
and create three files in path/to
:
bboxes.json
lines.txt
visualization.html
COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON!