/hocr-detect-columns

Detects columns and connects indented lines in hOCR files

Primary LanguageHTMLMIT LicenseMIT

hocr-detect-columns

Detects columns and connects indented lines in hOCR files. This Node.js module is used in the NYPL's NYC Space/Time Directory project to extract data from digitized New York City directories.

hOCR column detection

Most OCR tools can produce hOCR files — we are using Tesseract.

But how does it work?

First, hocr-detect-columns uses Cheerio to read all pages in the hOCR file (an hOCR file is just an HTML file with special properties).

Per page, the X positions of the bounding boxes of all OCR lines are clustered, using Simple Statistics. A page with n columns should have many lines with bounding boxes on or around n different X values. If clustering finds n clusters containing most of the OCR lines, we can expect the page has n columns.

To connect indented lines with the previous line they belong to, hocr-detect-columns uses a spatial index and tries to find, for each line which doesn't belong to a column, the closest line in the upper-left direction. The algorithms we need are implemented by RBush (spatial index) and rbush-knn (nearest neighbor search). You can read more about spatial search algorithms for JavaScript on Mapbox's blog.

Installation & Usage

Standalone

npm install -g nypl-spacetime/hocr-detect-columns

hocr-detect-columns can produce the following output formats:

  • Log to stdout (default):
hocr-detect-columns /path/to/file.hocr
  • Output JSON:
hocr-detect-columns --mode json /path/to/file.hocr
  • Output NDJSON:
hocr-detect-columns --mode ndjson /path/to/file.hocr
  • Output HTML visualization:
hocr-detect-columns --mode html /path/to/file.hocr

As a Node.js module

npm install --save nypl-spacetime/hocr-detect-columns
const fs = require('fs')
const detectColumns = require('hocr-detect-columns')

const hocr = fs.readFileSync('/path/to/file.hocr', 'utf8')

const config = {}

const pages = detectColumns(hocr, config)

Configuration

You can configure hocr-detect-columns by supplying a JSON configuration object or file:

{
  "columnCount": 2, // Amount of expected columns
  "characterWidth": 25, // Width of character, in pixels
  "minLinesPerColumn": 50 // Minimum expected lines, per column
}

Example

In the directory example, you can find an hOCR file of page 418 of the 1850 city directory, as well as a JSON file and HTML visualization generated by hocr-detect-columns.

To generate these files yourself, run:

hocr-detect-columns --mode json example/example.hocr

Or:

hocr-detect-columns --mode html example/example.hocr

Data

The format of the resulting JSON pages object is as follows:

{
  "config": {},
  "pages": [
    {
      "number": 0,
      "properties": {},
      "lines": [
        {
          "properties": {
            "bbox": [
              
            ],}
          "text": "contents of line"
          "columnIndex": 0,
          "completeText": "contents of line, appended with text from indented next lines"
        },

      ]
    },
    {
      "number": 1,
      "properties": {

      },
      "lines": [
        …
      ]
    },
    …
  ]
}