dbmdz/solr-ocrhighlighting

Compiled 0.6.0 snapshot give error on specific word

Closed this issue · 5 comments

@jbaiter I found a strange issue using compiled 0-6-0 from main.
Only when I search for the word "documenti" on this page indexed with
["<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ocr><p xml:id=\"sequence_2\" wh=\"2480 3508\"><b><l><w x=\".125 .069 .053 .01\">Senato</w> <w x=\".189 .069 .039 .01\">della</w> <w x=\".238 .069 .09 .012\">Repubblica</w></l><l><w x=\".707 .069 .061 .01\">Camera</w> <w x=\".779 .069 .023 .01\">dei</w> <w x=\".813 .069 .064 .012\">deputati</w></l><l><w x=\".477 .069 .049 .009\">–2–</w></l><l><w x=\".267 .097 .032 .006\">XVIII</w> <w x=\".307 .097 .092 .007\">LEGISLATURA</w> <w x=\".406 .101 .007 \">–</w> <w x=\".42 .097 .052 .007\">DISEGNI</w> <w x=\".48 .097 .013 .006\">DI</w> <w x=\".5 .097 .044 .007\">LEGGE</w> <w x=\".551 .097 .008 .006\">E</w> <w x=\".566 .097 .071 .007\">RELAZIONI</w> <w x=\".644 .101 .004 .001\">-</w> <w x=\".655 .097 .079 .007\">DOCUMENTI</w></l></b></p></ocr>"]

Solr log gives this error:

...
2021-04-26 16:27:45.845 ERROR (qtp1997859171-540) [   x:archipelago] d.d.s.l.OcrPassageFormatter Could not create snippet (start=16179, end=16751) from content at '<?xml ve
rsion="1.0" encoding=...' due to an out-of-bounds error.

Does the file on disk correspond to the document that was used during indexing?
java.lang.ArrayIndexOutOfBoundsException: 3
        at de.digitalcollections.solrocr.formats.miniocr.MiniOcrParser.readNext(MiniOcrParser.java:48) ~[?:?]
...

And if I switch back to 0.5.0 without re-indexing it works fine.
And all other search works fine with 0.6.0-snapshot.
This happens on Solr 8.7.0 and 8.8.1
Tell me if you need more info.
Anyway, thanks for your great plugin!!

This is a strange one. The offsets in the error message are out of bounds for the input document (which has length 746, i.e. less than half of the starting offset).
Are you sure the error is occurring because of the document in your description? Does it also happen if this document is the only doc in the index?
Also, can you please post the schema for the field you're using for the OCR?
What happens if you query for other terms/phrases that appear in the document?

@jbaiter Thanks, I made a complete series of check as follow:

  1. plugin 0.5.0 and indexed a 1 page doc_A
"tcocr_highlightm_X3b_und_ocr_text":["<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ocr><p xml:id=\"sequence_1\" wh=\"2480 3508\"><b><l><w x=\".125 .069 .053 .01\">Senato</w> <w x=\".189 .069 .039 .01\">della</w> <w x=\".238 .069 .09 .012\">Repubblica</w></l><l><w x=\".707 .069 .061 .01\">Camera</w> <w x=\".779 .069 .023 .01\">dei</w> <w x=\".813 .069 .064 .012\">deputati</w></l><l><w x=\".477 .069 .049 .009\">–2–</w></l><l><w x=\".267 .097 .032 .006\">XVIII</w> <w x=\".307 .097 .092 .007\">LEGISLATURA</w> <w x=\".406 .101 .007 \">–</w> <w x=\".42 .097 .052 .007\">DISEGNI</w> <w x=\".48 .097 .013 .006\">DI</w> <w x=\".5 .097 .044 .007\">LEGGE</w> <w x=\".551 .097 .008 .006\">E</w> <w x=\".566 .097 .071 .007\">RELAZIONI</w> <w x=\".644 .101 .004 .001\">-</w> <w x=\".655 .097 .079 .007\">DOCUMENTI</w></l></b></p></ocr>"],
  1. query doc_A for "documenti", "repubblica" and "legge" : all OK
2021-04-26 18:44:11.669 INFO  (qtp1997859171-566) [   x:archipelago] o.a.s.c.S.Request [archipelago]  webapp=/solr path=/select params={json.nl=flat&hl=true&TZ=UTC&fl=*,score&hl.requireFieldMatch=false&start=0&hl.fragsize=0&sort=score+desc&fq=ss_parent_id:"146"&fq=ss_search_api_datasource:"strawberryfield_flavor_datasource"&fq=ss_processor_id:"ocr"&fq=%2Bindex_id:default_solr_index&fq=ss_search_api_language:("en"+"und")&rows=20&hl.simple.pre=[HIGHLIGHT]&hl.snippets=3&q={!boost+b%3Dboost_document}++(tcocr_highlightm_X3b_en_ocr_text:(%2B"documenti")^1+tcocr_highlightm_X3b_und_ocr_text:(%2B"documenti")^1)&hl.mergeContiguous=false&hl.ocr.absoluteHighlights=on&hl.simple.post=[/HIGHLIGHT]&omitHeader=true&hl.method=UnifiedHighlighter&hl.ocr.fl=tcocr_highlightm_X3b_und_ocr_text&wt=json} hits=1 status=0 QTime=11
  1. switch plugin to 0.6.0 snapshot
  2. query doc_A for "documenti" and "legge" (NOTE those words are capitals ) give error
2021-04-26 18:50:19.879 ERROR (qtp1997859171-515) [   x:archipelago] d.d.s.l.OcrPassageFormatter Could not create snippet (start=205, end=746) from content at '<?xml version="1.0" encoding=...' due to an out-of-bounds error.
Does the file on disk correspond to the document that was used during indexing?
java.lang.ArrayIndexOutOfBoundsException: 3
        at de.digitalcollections.solrocr.formats.miniocr.MiniOcrParser.readNext(MiniOcrParser.java:48) ~[?:?]
        at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:127) ~[?:?]
        at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:25) ~[?:?]
        at de.digitalcollections.solrocr.lucene.OcrPassageFormatter.parseWords(OcrPassageFormatter.java:327) ~[?:?]
        at de.digitalcollections.solrocr.lucene.OcrPassageFormatter.parseFragment(OcrPassageFormatter.java:208) ~[?:?]
        at de.digitalcollections.solrocr.lucene.OcrPassageFormatter.format(OcrPassageFormatter.java:185) ~[?:?]
        at de.digitalcollections.solrocr.lucene.OcrPassageFormatter.format(OcrPassageFormatter.java:97) ~[?:?]
        at de.digitalcollections.solrocr.lucene.OcrFieldHighlighter.highlightFieldForDoc(OcrFieldHighlighter.java:64) ~[?:?]
        at de.digitalcollections.solrocr.lucene.OcrHighlighter.highlightOcrFields(OcrHighlighter.java:304) ~[?:?]
        at de.digitalcollections.solrocr.solr.SolrOcrHighlighter.doHighlighting(SolrOcrHighlighter.java:48) ~[?:?]
        at de.digitalcollections.solrocr.solr.OcrHighlightComponent.process(OcrHighlightComponent.java:76) ~[?:?]
  1. query doc_A for "repubblica" : OK
2021-04-26 18:52:35.112 INFO  (qtp1997859171-556) [   x:archipelago] o.a.s.c.S.Request [archipelago]  webapp=/solr path=/select params={json.nl=flat&hl=true&TZ=UTC&fl=*,score&hl.requireFieldMatch=false&start=0&hl.fragsize=0&sort=score+desc&fq=ss_parent_id:"146"&fq=ss_search_api_datasource:"strawberryfield_flavor_datasource"&fq=ss_processor_id:"ocr"&fq=%2Bindex_id:default_solr_index&fq=ss_search_api_language:("en"+"und")&rows=20&hl.simple.pre=[HIGHLIGHT]&hl.snippets=3&q={!boost+b%3Dboost_document}++(tcocr_highlightm_X3b_en_ocr_text:(%2B"repubblica")^1+tcocr_highlightm_X3b_und_ocr_text:(%2B"repubblica")^1)&hl.mergeContiguous=false&hl.ocr.absoluteHighlights=on&hl.simple.post=[/HIGHLIGHT]&omitHeader=true&hl.method=UnifiedHighlighter&hl.ocr.fl=tcocr_highlightm_X3b_und_ocr_text&wt=json} hits=1 status=0 QTime=58
  1. Ingest 1 page doc_B
"tcocr_highlightm_X3b_und_ocr_text":["<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ocr><p xml:id=\"sequence_1\" wh=\"2480 3508\"><b><l><w x=\".125 .069 .053 .01\">Senato</w> <w x=\".189 .069 .039 .01\">della</w> <w x=\".238 .069 .09 .012\">Repubblica</w></l><l><w x=\".707 .069 .061 .01\">Camera</w> <w x=\".779 .069 .023 .01\">dei</w> <w x=\".813 .069 .064 .012\">deputati</w></l><l><w x=\".477 .069 .049 .009\">–2–</w></l><l><w x=\".267 .097 .032 .006\">XVIII</w> <w x=\".307 .097 .092 .007\">LEGISLATURA</w> <w x=\".406 .101 .007 \">–</w> <w x=\".42 .097 .052 .007\">DISEGNI</w> <w x=\".48 .097 .013 .006\">DI</w> <w x=\".5 .097 .044 .007\">LEGGE</w> <w x=\".551 .097 .008 .006\">E</w> <w x=\".566 .097 .071 .007\">RELAZIONI</w> <w x=\".644 .101 .004 .001\">-</w> <w x=\".655 .097 .079 .007\">DOCUMENTI</w></l></b></p></ocr>"],
  1. query doc_B for "documenti" and "legge" give error
2021-04-26 18:59:24.182 ERROR (qtp1997859171-569) [   x:archipelago] d.d.s.l.OcrPassageFormatter Could not create snippet (start=205, end=746) from content at '<?xml version="1.0" encoding=...' due to an out-of-bounds error.
Does the file on disk correspond to the document that was used during indexing?
java.lang.ArrayIndexOutOfBoundsException: 3
  1. query doc_B for "repubblica" : OK
2021-04-26 19:00:34.087 INFO  (qtp1997859171-569) [   x:archipelago] o.a.s.c.S.Request [archipelago]  webapp=/solr path=/select params={json.nl=flat&hl=true&TZ=UTC&fl=*,score&hl.requireFieldMatch=false&start=0&hl.fragsize=0&sort=score+desc&fq=ss_parent_id:"147"&fq=ss_search_api_datasource:"strawberryfield_flavor_datasource"&fq=ss_processor_id:"ocr"&fq=%2Bindex_id:default_solr_index&fq=ss_search_api_language:("en"+"und")&rows=20&hl.simple.pre=[HIGHLIGHT]&hl.snippets=3&q={!boost+b%3Dboost_document}++(tcocr_highlightm_X3b_en_ocr_text:(%2B"repubblica")^1+tcocr_highlightm_X3b_und_ocr_text:(%2B"repubblica")^1)&hl.mergeContiguous=false&hl.ocr.absoluteHighlights=on&hl.simple.post=[/HIGHLIGHT]&omitHeader=true&hl.method=UnifiedHighlighter&hl.ocr.fl=tcocr_highlightm_X3b_und_ocr_text&wt=json} hits=1 status=0 QTime=10
  1. Switch back to 0.5.0 : All (A and B) OK

My Field schema:

<fieldType name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
  <analyzer type="index">
    <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="accents_und.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ElisionFilterFactory" articles="lang/contractions_it.txt"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_it.txt"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_und.txt" />
    <filter class="solr.SnowballPorterFilterFactory" language="Italian"/>
    <filter class="solr.LengthFilterFactory" min="3" max="100"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.MappingCharFilterFactory" mapping="accents_und.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ElisionFilterFactory" articles="lang/contractions_it.txt"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_it.txt"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_und.txt" />
    <filter class="solr.SnowballPorterFilterFactory" language="Italian"/>
    <filter class="solr.LengthFilterFactory" min="3" max="100"/>
  </analyzer>
</fieldType>

Final consideration:
A) something related to capital letters and field schema, probably ?!?!
B) why it works with 0.5.0 and not with 0.6.0 ?

Thanks a lot for when you will be able to check this, I hope something related to my config and not a plugin bug.

Thanks a lot for the in-depth testing @giancarlobi, the full stack trace was most helpful in locating the root of the problem:

<w x=".406 .101 .007 ">–</w>

You have a word-element with no height, which is why MiniOcrParser#readNext throws an error when parsing the coordinates, it tries to access the missing fourth element.

I'll add a check to the code that simply doesn't set the coordinates if this happens, so it'll at least keep working for the other words with the full coordinates :-)

@jbaiter Great! thanks a lot and I'm happy it's not a bug of your plugin. I'll check this with @DiegoPino in our code.
Again, I'm sorry to waste your time so "Grazie mille" for your quickly response. Have a nice day

No worries, it wasn't a waste of time at all, problems like these are always a great opportunity to improve the error tolerance/reporting which is going to help other users down the road, so thank you! :-)