dbmdz/solr-ocrhighlighting

Box Merging leads to mismatch between text and image highlights.

jbaiter opened this issue · 0 comments

When a query for individual tokens matches two adjacent tokens, their boxes are merged when building the highlight boxes, but their text highlights remain separated.

  {
    "text": "<em>Die</em><em>Zahl</em>derer,welchejeneSchreckens: zeitmitAugenſahen,inwelcherZittau, imGefolgedesſiebenjährigenKrieges,den 23.Juli1757,auf<em>die</em>ſchre>li<ſteArt zerſtörtward,kannzwarnurnochklein ſeyn,jedochiſtgewißjedembiedernZit-",
    // ...
    "highlights": [
      [
        {
          "ulx": 142,
          "uly": 720,
          "lrx": 348,
          "lry": 792,
          "text": "Die Zahl",
          "parentRegionIdx": 0
        }
      ],
      [
        {
          "ulx": 585,
          "uly": 892,
          "lrx": 637,
          "lry": 929,
          "text": "die",
          "parentRegionIdx": 0
        }
      ]
    ]
  }

Thanks to @ulb-sa-schmilj for reporting!