dbmdz/solr-ocrhighlighting

Add support for snippets crossing page boundaries

jbaiter opened this issue · 0 comments

Currently we always end a snippet once it hits the end of a page block. This is a reasonable thing to do for a lot of OCR documents, since the first line of the new page will likely not be a continuation of the last line of the previous page, but rather a header line or a page number.

However, for users that know that their documents are structured in a way that would make this possible, we should keep open the possibility of generating multi-page snippets.

The way to go about this would be to simply allow limitIter = null (i.e. hl.ocr.limitBlock is the empty string) in the ContextBreakIterator class, i.e. don't terminate the passage generation early at all.

This would require changing the API: The snippet can now consist of multiple regions, each with their own page identifier and coordinates. Highlighting regions would have to be expanded to add a reference to the page they're on:

{
  "text": "Some text that crosses a <em>page boundary</em>",
  "score": 881062.75,
  "regions": [
    {"page_id": "page-1", "ulx": 194, "uly": 807, "lrx": 1196, "lry": 1008 },
    {"page_id": "page-2", "ulx": 13, "uly": 64, "lrx": 1012, "lry": 128 }
  ],
  "highlights": [
    [ { "page": "page-1",
         "text": "page",
         "ulx": 694,
         "uly": 82,
         "lrx": 823,
         "lry": 111 } ],
    [ { "page": "page-2",
        "text": "boundary",
        "ulx": 450,
        "uly": 162,
        "lrx": 563,
        "lry": 190 } ]
  ]
}