dbmdz/solr-ocrhighlighting

ALTO whitespace handling is inconsistent

jbaiter opened this issue · 1 comments

When an ALTO file does not explicitly denote whitespace with <SP>, the text for the whole snippet does not include whitespace, while the text for each individual region does:

  {
    "text": "DieZahlderer,welchejeneSchreckens: zeitmitAugenſahen,inwelcher<em>Zittau</em>, <em>im</em>GefolgedesſiebenjährigenKrieges,den 23.Juli1757,aufdieſchre>li<ſteArt zerſtörtward,kannzwarnurnochklein",
    "score": 662.4285,
    "pages": [
      {
        "id": "p00000001",
        "width": 1269,
        "height": 1947
      }
    ],
    "regions": [
      {
        "ulx": 141,
        "uly": 720,
        "lrx": 989,
        "lry": 984,
        "text": "Die Zahl derer, welche jene Schreckens: zeit mit Augen ſahen, in welcher <em>Zittau</em>, <em>im</em> Gefolge des ſiebenjährigen Krieges, den 23. Juli 1757, auf die ſchre>li<ſte Art zerſtört ward, kann zwar nur noch klein",
        "pageIdx": 0
      }
    ],
    // ...
}

Thanks to @ulb-sa-schmilj for reporting!

Couldn't reproduce this with the most recent version, closing until further notice.