Context regions include whitespace-only lines that should have been removed by the indexing pipeline

Question

Context regions include whitespace-only lines that should have been removed by the indexing pipeline

mbennett-uoe opened this issue 4 years ago · 6 comments

I am making a search against an ALTO file, which contains a large amount of lines that have nothing but whitespace (an unfortunate article of the OCR process).

For example:

<TextBlock ID="block_6" HPOS="452" VPOS="5849" WIDTH="13" HEIGHT="658">
  <TextLine ID="line_49" HPOS="452" VPOS="5849" WIDTH="13" HEIGHT="658">
      <String ID="string_74" HPOS="452" VPOS="5849" WIDTH="13" HEIGHT="658" WC="0.95" CONTENT=" "/>
  </TextLine>
</TextBlock>
<TextBlock ID="block_7" HPOS="452" VPOS="5849" WIDTH="13" HEIGHT="658">
  <TextLine ID="line_50" HPOS="452" VPOS="5849" WIDTH="13" HEIGHT="658">
      <String ID="string_75" HPOS="452" VPOS="5849" WIDTH="13" HEIGHT="658" WC="0.95" CONTENT=" "/>
  </TextLine>
</TextBlock>
<TextBlock ID="block_8" HPOS="2887" VPOS="518" WIDTH="414" HEIGHT="189">
  <TextLine ID="line_51" HPOS="2887" VPOS="518" WIDTH="414" HEIGHT="189">
      <String ID="string_76" HPOS="2887" VPOS="593" WIDTH="93" HEIGHT="114" WC="0.44" CONTENT="if"/><SP WIDTH="284" VPOS="593" HPOS="2980"/>
      <String ID="string_77" HPOS="3264" VPOS="590" WIDTH="37" HEIGHT="99" WC="0.44" CONTENT="36%"/>
  </TextLine>
</TextBlock>
<TextBlock ID="block_9" HPOS="1547" VPOS="760" WIDTH="3110" HEIGHT="276">
  <TextLine ID="line_52" HPOS="1549" VPOS="760" WIDTH="3070" HEIGHT="172">
      <String ID="string_78" HPOS="1549" VPOS="760" WIDTH="168" HEIGHT="172" WC="0.85" CONTENT="6d:"/><SP WIDTH="72" VPOS="760" HPOS="1717"/>
      <String ID="string_79" HPOS="1789" VPOS="816" WIDTH="285" HEIGHT="82" WC="0.96" CONTENT="which"/><SP WIDTH="68" VPOS="816" HPOS="2074"/>
      <String ID="string_80" HPOS="2142" VPOS="772" WIDTH="131" HEIGHT="124" WC="0.96" CONTENT="he"/><SP WIDTH="37" VPOS="772" HPOS="2273"/>
      <String ID="string_81" HPOS="2310" VPOS="816" WIDTH="387" HEIGHT="99" WC="0.96" CONTENT="received"/><SP WIDTH="53" VPOS="816" HPOS="2697"/>
      <String ID="string_82" HPOS="2750" VPOS="813" WIDTH="267" HEIGHT="87" WC="0.92" CONTENT="from."/><SP WIDTH="35" VPOS="813" HPOS="3017"/>
      <String ID="string_83" HPOS="3052" VPOS="801" WIDTH="289" HEIGHT="88" WC="0.83" CONTENT="Meflrs"/><SP WIDTH="48" VPOS="801" HPOS="3341"/>
      <String ID="string_84" HPOS="3389" VPOS="797" WIDTH="293" HEIGHT="102" WC="0.96" CONTENT="Seton,"/><SP WIDTH="57" VPOS="797" HPOS="3682"/>
      <String ID="string_85" HPOS="3739" VPOS="803" WIDTH="379" HEIGHT="106" WC="0.96" CONTENT="Wallace"/><SP WIDTH="46" VPOS="803" HPOS="4118"/>
      <String ID="string_86" HPOS="4164" VPOS="800" WIDTH="161" HEIGHT="81" WC="0.92" CONTENT="and"/><SP WIDTH="40" VPOS="800" HPOS="4325"/>
      <String ID="string_87" HPOS="4365" VPOS="799" WIDTH="254" HEIGHT="129" WC="0.86" CONTENT="Com-"/>
  </TextLine>
  <TextLine ID="line_53" HPOS="1547" VPOS="917" WIDTH="3110" HEIGHT="119">
      <String ID="string_88" HPOS="1547" VPOS="960" WIDTH="254" HEIGHT="76" WC="0.96" CONTENT="pany,"/><SP WIDTH="72" VPOS="960" HPOS="1801"/>
      <String ID="string_89" HPOS="1873" VPOS="931" WIDTH="163" HEIGHT="78" WC="0.95" CONTENT="and"/><SP WIDTH="62" VPOS="931" HPOS="2036"/>
      <String ID="string_90" HPOS="2098" VPOS="927" WIDTH="227" HEIGHT="80" WC="0.95" CONTENT="from"/><SP WIDTH="61" VPOS="927" HPOS="2325"/>
      <String ID="string_91" HPOS="2386" VPOS="930" WIDTH="126" HEIGHT="93" WC="0.94" CONTENT="Sir"/><SP WIDTH="67" VPOS="930" HPOS="2512"/>
      <String ID="string_92" HPOS="2579" VPOS="926" WIDTH="379" HEIGHT="80" WC="0.94" CONTENT="William"/><SP WIDTH="24" VPOS="926" HPOS="2958"/>
      <String ID="string_93" HPOS="2982" VPOS="917" WIDTH="324" HEIGHT="123" WC="0.94" CONTENT="Forbes"/><SP WIDTH="77" VPOS="917" HPOS="3306"/>
      <String ID="string_94" HPOS="3383" VPOS="917" WIDTH="163" HEIGHT="77" WC="0.96" CONTENT="and"/><SP WIDTH="71" VPOS="917" HPOS="3546"/>
      <String ID="string_95" HPOS="3617" VPOS="918" WIDTH="459" HEIGHT="99" WC="0.95" CONTENT="Company,"/><SP WIDTH="51" VPOS="918" HPOS="4076"/>
      <String ID="string_96" HPOS="4127" VPOS="930" WIDTH="116" HEIGHT="62" WC="0.78" CONTENT="on&#8217;"/><SP WIDTH="27" VPOS="930" HPOS="4243"/>
      <String ID="string_97" HPOS="4270" VPOS="926" WIDTH="345" HEIGHT="101" WC="0.71" CONTENT="giving"/><SP WIDTH="33" VPOS="926" HPOS="4615"/>
      <String ID="string_98" HPOS="4648" VPOS="918" WIDTH="9" HEIGHT="49" WC="0.47" CONTENT="|"/>
  </TextLine>
</TextBlock>

Searching "Forbes" in this document, with the following parameters: hl=on&hl.ocr.absoluteHighlights=true&df=ocr_text&hl.ocr.fl=ocr_text&hl.ocr.limitBlock=page&hl.ocr.contextBloc=line&hl.ocr.contextSize=3 returns the following (for the above example section):

{
  "text":"if 36% 6d: which he received from. Meflrs Seton, Wallace and Com- pany, and from Sir William <em>Forbes</em> and Company, on’ giving | them indorfations to the accommodation-bills, to which amount thefe gentlemen are accordingly ranked on Forrefter’s eftate; but the truftee for Laidlaw’s creditors further contends, and the",
  "score":112779.664,
  "pages":[{
      "id":"page_6",
      "width":5835,
      "height":6853}],
  "regions":[{
      "ulx":452,
      "uly":5849,
      "lrx":465,
      "lry":6507,
      "text":" ",
      "pageIdx":0},
    {
      "ulx":1546,
      "uly":590,
      "lrx":4657,
      "lry":1362,
      "text":"if 36% 6d: which he received from. Meflrs Seton, Wallace and Com- pany, and from Sir William <em>Forbes</em> and Company, on’ giving | them indorfations to the accommodation-bills, to which amount thefe gentlemen are accordingly ranked on Forrefter’s eftate; but the truftee for Laidlaw’s creditors further contends, and the",
      "pageIdx":0}],
  "highlights":[[{
        "ulx":2982,
        "uly":917,
        "lrx":3306,
        "lry":1040,
        "text":"Forbes",
        "parentRegionIdx":1}]]},

I would expect the blank region to be ignored, since the SOLR indexing should have removed those whitespace characters. I assume that it's included because the code that generates the context areas works solely from the input document, and not from what's in the SOLR index?

Answer 1 · 2020-05-05T12:40:00.000Z

Here is a file with which you can replicate this behaviour: CSP-15-test.xml.txt

I understand that this is likely to be an edge case, and am very happy to solve it by either cleaning up the XML files or adding some logic to my code to ignore regions containing only whitespace, but I'd be interested to know if there's a more elegant way to achieve this inside the actual plugin?

Answer 2 · 2020-05-11T09:30:44.000Z

Thanks for reporting this and providing the sample data! :-)

This is indeed an edge case and your intuition is correct, the plugin only uses the positions of the matching terms from Solr and determines everything else from the OCR file, so the output will contain text that was ignored by Solr (for good reason).

For this edge case, I think I'll just try to filter out regions that only contain whitespace (or no text at all), as you suggested. I'm currently preparing a new release, I think I should manage to implement it before that :-)

Answer 3 · 2020-05-11T10:56:02.000Z

Amazing, thanks for the quick response and solution. I'd ended up temporarily fixing it with a nasty hack in OcrBox to return a blank SimpleOrderedMap if the text was empty, so this is much nicer!

Answer 4 · 2020-05-11T10:59:26.000Z

One quick question. Is it worth using .strip() instead of .trim() so that Unicode space characters are handled correctly?

Answer 5 · 2020-05-11T11:19:41.000Z

You're right, it would, but .strip() is only available from Java 11 on, and I'd like to keep the plugin compatible with Java 8 JVMs, which is currently Solr's minimum supported version.
I'll check if there's an equivalent in Guava or Apache Commons, will have to wait for the next release, though, sorry :-/

Answer 6 · 2020-05-11T11:27:40.000Z

Gotcha :) I have to confess to a lot of ignorance about Java stuff, as it's not a language I program in regularly. I only discovered that there was both strip and trim and that there was a difference between them when doing a quick search to find out what the whitespace trimming function in Java was called because I didn't even know that 😀

Outside of that, I'm happy to confirm that I just compiled v0.4.0 from the latest commit (just before you pushed the .JAR release to GH) and this fix is working perfectly. Thanks again :)