Add support for snippets crossing page boundaries
jbaiter opened this issue · 0 comments
Currently we always end a snippet once it hits the end of a page block. This is a reasonable thing to do for a lot of OCR documents, since the first line of the new page will likely not be a continuation of the last line of the previous page, but rather a header line or a page number.
However, for users that know that their documents are structured in a way that would make this possible, we should keep open the possibility of generating multi-page snippets.
The way to go about this would be to simply allow limitIter = null
(i.e. hl.ocr.limitBlock
is the empty string) in the ContextBreakIterator
class, i.e. don't terminate the passage generation early at all.
This would require changing the API: The snippet can now consist of multiple regions, each with their own page identifier and coordinates. Highlighting regions would have to be expanded to add a reference to the page they're on:
{
"text": "Some text that crosses a <em>page boundary</em>",
"score": 881062.75,
"regions": [
{"page_id": "page-1", "ulx": 194, "uly": 807, "lrx": 1196, "lry": 1008 },
{"page_id": "page-2", "ulx": 13, "uly": 64, "lrx": 1012, "lry": 128 }
],
"highlights": [
[ { "page": "page-1",
"text": "page",
"ulx": 694,
"uly": 82,
"lrx": 823,
"lry": 111 } ],
[ { "page": "page-2",
"text": "boundary",
"ulx": 450,
"uly": 162,
"lrx": 563,
"lry": 190 } ]
]
}