dbmdz/solr-ocrhighlighting

Resolve hyphenation in indexing analysis chain

jbaiter opened this issue · 1 comments

Currently, hyphenation is not resolved when indexing OCR documents from disk. It would be good to have a way to resolve hyphenations that are designated in the source data during indexing.

A way to go about this would be:

  1. Replace the two hyphenation parts with a single Word block (while keeping the length the same!) that contains the dehyphenated form at indexing time
  2. At highlighting time, check if the highlighted span contains multiple word blocks. If so, use the dehyphenated forms for building the plaintext snippet and the hyphenated parts for calculating the regions and highlighting snippets.

This would have to be implemented for each format:

  • ALTO supports hyphenation with @SUBSTYPE="HypPart1/2" / @SUBS_CONTENT / and <HYP />
  • hOCR supports hyphenation by encoding it with &shy;
  • MiniOCR does not support hyphenation at the moment

An open question is how this is different/better than Solr's own hyphenation filter.


After some thinking, this involves the following tasks:

  • ALTO: Resolve hyphenation in AltoCharFilterFactory so that de-hphenated tokens are indexed (see 279f3be)
  • ALTO: Add code to AltoPassageFormatter#getTextFromXml to resolve the hyphenations (see
    fb5c201)
  • hOCR: Add a HocrCharFilterFactory that strips out the &shy; characters so downstream tokenization does not split the hpyenated tokens. (see f770248)

At passage generation time, nothing needs to be done for hOCR, since text renders won't display the soft hyphen anyway.

Implemented in #45