No blank between words with 0.6.0 compiled from main

Question

No blank between words with 0.6.0 compiled from main

giancarlobi opened this issue 4 years ago · 25 comments

@jbaiter I was trying to compile from main your plugin (resulting in a 0.6.0-SNAPSHOT) and installed over a Solr 8.8.1.
I found that words are indexed without space between, like this

I switched back to 0.5.0 without change anything and the right indexing happens:

Have you any notes about this? I missed some new configuration parameters?
Thanks for your fantastic plugin

Answer 1 · 2021-03-15T19:43:20.000Z

Thank you for the bug report!
Can you provide a sample page of your OCR? The way the markup is parsed has changed with the new version, we now use a proper XML parser instead of the previous state machine approach, it's likely that I missed something.

Answer 2 · 2021-03-15T20:03:13.000Z

@jbaiter This is the content of field into Solr after ingesting, is this what you need?

<?xml version="1.0" encoding="UTF-8"?>
<ocr>
   <p xml:id="sequence_3" wh="2479 3509">
      <b>
         <l>
            <w x=".119 .045 .07 .011">Rapporto</w>
            <w x=".195 .045 .06 .01">Tecnico,</w>
            <w x=".262 .048 .056 .006">numero</w>
            <w x=".323 .045 .011 .011">3,</w>
            <w x=".34 .045 .052 .011">Agosto</w>
            <w x=".397 .044 .037 .009">2016</w>
         </l>
         <l>
            <w x=".134 .142 .106 .02">FABB</w>
            <w x=".255 .142 .182 .027">Repository</w>
            <w x=".45 .142 .049 .021">dal</w>
            <w x=".511 .145 .138 .024">progetto</w>
            <w x=".663 .142 .028 .021">al</w>
            <w x=".702 .142 .161 .027">prototipo.</w>
         </l>
         <l>
            <w x=".124 .177 .111 .02">Nuove</w>
            <w x=".246 .176 .099 .021">forme</w>
            <w x=".358 .176 .03 .021">di</w>
            <w x=".401 .176 .247 .025">conservazione,</w>
            <w x=".662 .176 .213 .021">condivisione</w>
         </l>
         <l>
            <w x=".226 .217 .017 .014">e</w>
            <w x=".255 .21 .243 .021">valorizzazione</w>
            <w x=".51 .21 .03 .021">di</w>
            <w x=".553 .217 .091 .02">opere</w>
            <w x=".657 .21 .117 .027">digitali</w>
         </l>
         <l>
            <w x=".263 .323 .095 .012">Giancarlo</w>
            <w x=".365 .323 .069 .014">Birello,</w>
            <w x=".442 .324 .052 .011">Ivano</w>
            <w x=".502 .323 .063 .014">Fucile,</w>
            <w x=".575 .323 .058 .012">Valter</w>
            <w x=".639 .323 .097 .011">Giovanetti</w>
         </l>
         <l>
            <w x=".411 .349 .093 .01">Ircres-CNR</w>
            <w x=".512 .349 .053 .013">Ufficio</w>
            <w x=".571 .349 .02 .009">IT</w>
         </l>
         <l>
            <w x=".418 .365 .046 .009">Strada</w>
            <w x=".47 .365 .035 .009">delle</w>
            <w x=".51 .365 .048 .011">Cacce,</w>
            <w x=".565 .365 .017 .009">73</w>
         </l>
         <l>
            <w x=".432 .38 .043 .009">10135</w>
            <w x=".481 .38 .05 .009">Torino</w>
            <w x=".537 .38 .033 .012">Italy</w>
         </l>
         <l>
            <w x=".44 .426 .05 .011">Anna</w>
            <w x=".497 .426 .062 .011">Perin*</w>
         </l>
         <l>
            <w x=".409 .452 .093 .01">Ircres-CNR</w>
            <w x=".508 .451 .083 .01">Biblioteca</w>
         </l>
         <l>
            <w x=".42 .468 .026 .009">Via</w>
            <w x=".451 .468 .033 .009">Real</w>
            <w x=".49 .468 .067 .012">Collegio,</w>
            <w x=".563 .468 .017 .009">30</w>
         </l>
         <l>
            <w x=".4 .483 .044 .009">10024</w>
            <w x=".449 .483 .08 .009">Moncalieri</w>
            <w x=".535 .483 .024 .009">TO</w>
            <w x=".564 .483 .033 .012">Italy</w>
         </l>
         <l>
            <w x=".119 .563 .109 .01">ABSTRACT:</w>
            <w x=".236 .563 .051 .01">FABB</w>
            <w x=".293 .563 .056 .013">project</w>
            <w x=".355 .563 .066 .012">(Famine</w>
            <w x=".428 .563 .028 .01">and</w>
            <w x=".462 .564 .046 .011">Feast,</w>
            <w x=".515 .564 .044 .01">Fame</w>
            <w x=".565 .567 .008 .007">e</w>
            <w x=".58 .563 .107 .012">Abbondanza)</w>
            <w x=".693 .563 .026 .01">has</w>
            <w x=".726 .563 .038 .01">been</w>
            <w x=".769 .563 .086 .01">committed</w>
            <w x=".861 .563 .02 .013">by</w>
         </l>
         <l>
            <w x=".119 .58 .093 .01">Fondazione</w>
            <w x=".221 .58 .042 .01">CRT.</w>
            <w x=".273 .58 .034 .01">This</w>
            <w x=".317 .58 .072 .01">technical</w>
            <w x=".397 .581 .048 .011">report</w>
            <w x=".453 .58 .067 .013">analyzes</w>
            <w x=".53 .58 .024 .01">the</w>
            <w x=".563 .58 .074 .013">strategies</w>
            <w x=".647 .58 .063 .013">adopted</w>
            <w x=".718 .58 .029 .01">and</w>
            <w x=".755 .58 .024 .01">the</w>
            <w x=".787 .58 .04 .01">main</w>
            <w x=".835 .583 .045 .01">open-</w>
         </l>
         <l>
            <w x=".12 .599 .051 .007">source</w>
            <w x=".182 .596 .068 .01">software</w>
            <w x=".259 .596 .041 .01">used.</w>
            <w x=".311 .596 .093 .01">Ircres-CNR</w>
            <w x=".414 .596 .026 .01">has</w>
            <w x=".451 .596 .073 .013">deployed</w>
            <w x=".534 .596 .024 .01">the</w>
            <w x=".569 .596 .068 .01">software</w>
            <w x=".647 .596 .028 .01">and</w>
            <w x=".686 .599 .048 .007">server</w>
            <w x=".743 .596 .076 .013">platforms</w>
            <w x=".83 .596 .017 .01">of</w>
            <w x=".856 .596 .024 .01">the</w>
         </l>
         <l>
            <w x=".119 .613 .086 .013">repository,</w>
            <w x=".216 .613 .015 .01">in</w>
            <w x=".242 .616 .008 .007">a</w>
            <w x=".261 .613 .086 .01">virtualized</w>
            <w x=".358 .613 .028 .01">and</w>
            <w x=".396 .613 .08 .01">redundant</w>
            <w x=".487 .613 .113 .012">infrastructure,</w>
            <w x=".611 .613 .011 .01">it</w>
            <w x=".633 .613 .031 .01">also</w>
            <w x=".675 .613 .033 .01">take</w>
            <w x=".718 .616 .033 .007">care</w>
            <w x=".762 .613 .017 .01">of</w>
            <w x=".789 .613 .024 .01">the</w>
            <w x=".823 .613 .056 .013">design,</w>
         </l>
         <l>
            <w x=".119 .629 .104 .013">development</w>
            <w x=".23 .629 .028 .01">and</w>
            <w x=".264 .63 .103 .011">management</w>
            <w x=".372 .629 .017 .01">of</w>
            <w x=".395 .629 .024 .01">the</w>
            <w x=".425 .629 .032 .01">web</w>
            <w x=".464 .629 .046 .013">portal</w>
            <w x=".517 .629 .087 .012">(front-end)</w>
            <w x=".611 .629 .023 .01">for</w>
            <w x=".64 .629 .024 .01">the</w>
            <w x=".67 .629 .102 .013">presentation,</w>
            <w x=".779 .629 .067 .01">research</w>
            <w x=".852 .629 .028 .01">and</w>
         </l>
         <l>
            <w x=".119 .645 .084 .013">consulting</w>
            <w x=".209 .645 .033 .01">data</w>
            <w x=".247 .645 .017 .01">of</w>
            <w x=".269 .645 .024 .01">the</w>
            <w x=".299 .645 .085 .013">digitalized</w>
            <w x=".389 .645 .042 .01">items</w>
            <w x=".438 .645 .054 .013">(lyrics,</w>
            <w x=".499 .645 .044 .013">lyrics</w>
            <w x=".549 .647 .034 .01">text,</w>
            <w x=".589 .645 .088 .012">interviews,</w>
            <w x=".683 .645 .052 .012">books,</w>
            <w x=".741 .645 .063 .013">poems).</w>
         </l>
         <l>
            <w x=".12 .695 .04 .009">KEY</w>
            <w x=".165 .694 .077 .01">WORDS:</w>
            <w x=".25 .698 .102 .01">open-source,</w>
            <w x=".358 .694 .078 .012">islandora,</w>
            <w x=".442 .694 .086 .013">repository,</w>
            <w x=".534 .694 .05 .013">digital</w>
            <w x=".591 .694 .063 .012">archive,</w>
            <w x=".66 .694 .061 .01">cultural</w>
            <w x=".726 .694 .064 .013">heritage</w>
         </l>
         <l>
            <w x=".119 .744 .033 .01">JEL</w>
            <w x=".157 .744 .069 .01">CODES:</w>
            <w x=".234 .744 .03 .01">Z11</w>
         </l>
         <l>
            <w x=".119 .864 .202 .001">____________________</w>
         </l>
         <l>
            <w x=".12 .886 .119 .013">*Corresponding</w>
            <w x=".244 .887 .05 .009">author:</w>
            <w x=".302 .887 .178 .012">anna.perin@ircres.cnr.it</w>
         </l>
      </b>
   </p>
</ocr>

Thanks !!!

Answer 3 · 2021-03-15T20:19:14.000Z

Thank you, that helps a lot :-)
It's likely a bug in the way implicit whitespace is handled when dealing with MiniOCR, will provide a fix tomorrow!

Answer 4 · 2021-03-15T21:34:10.000Z

@jbaiter Great! so quick, I'm available to check the fix in our production deployment! Thanks.

Answer 5 · 2021-03-15T22:29:52.000Z

@jbaiter thanks so much. You are just awesome 🥇

Answer 6 · 2021-03-16T09:17:59.000Z

So I just built a testcase with the provided page, and for some reason I can't seem to reproduce the problem. For example, here's the snippet I get for the query "consulting data of the digitized items":

<lst>
          <str name="text">repository, in a virtualized and redundant infrastructure, it also take care of the design, development and management of the web portal (front-end) for the presentation, research and &lt;em&gt;consulting data of the digitalized items&lt;/em&gt; (lyrics, lyrics text, interviews, books, poems). KEY WORDS: open-source, islandora, repository, digital archive, cultural heritage JEL CODES: Z11</str>
          <float name="score">1490.7888</float>
          <arr name="pages">
            <lst>
              <str name="id">sequence_3</str>
              <int name="width">2479</int>
              <int name="height">3509</int>
            </lst>
          </arr>
          <arr name="regions">
            <lst>
              <float name="ulx">0.119</float>
              <float name="uly">0.613</float>
              <float name="lrx">0.88</float>
              <float name="lry">0.754</float>
              <str name="text">repository, in a virtualized and redundant infrastructure, it also take care of the design, development and management of the web portal (front-end) for the presentation, research and &lt;em&gt;consulting data of the digitalized items&lt;/em&gt; (lyrics, lyrics text, interviews, books, poems). KEY WORDS: open-source, islandora, repository, digital archive, cultural heritage JEL CODES: Z11</str>
              <int name="pageIdx">0</int>
            </lst>
          </arr>
          <arr name="highlights">
            <arr>
              <lst>
                <int name="ulx">0</int>
                <float name="uly">0.2269</float>
                <float name="lrx">0.4099</float>
                <float name="lry">0.3191</float>
                <str name="text">consulting data of the digitalized items</str>
                <int name="parentRegionIdx">0</int>
              </lst>
            </arr>
          </arr>
        </lst>

This tells me that the whitespace-handling from the OCR parser is correct for this file, since we find a match for the phrase.

Can you show how your index analysis pipeline is configured? I'm suspecting that this is probably related to how the tokenizer is configured.

I just noticed that the same schema works with 0.5.0, so this is really something that is in the plugin. Can you please provide a sample page for which you are certain that the problem is happening? E.g. a page where one of the terms/term sequences from your screenshot is occurring.

P.S.: If you're using MiniOCR to save on index space, you're leaving a few bytes on the table by not stripping the extraneous whitespace :-) The only whitespace that is needed is the one between the individual words, everything else is ignored anyway and takes up precious space. In your case, "minifying" the file would result in saving ~20% of the file size (7.1KiB vs 5.8KiB uncompressed). The practical impact is likely to be a lot smaller, though, since Lucene compresses segments with LZ4, but it's maybe something you might want to benchmark if space is of consideration for you.

Answer 7 · 2021-03-16T09:54:47.000Z

@jbaiter thanks a lot, attached the page exactly as indexed in the screenshot above. Is this what you need?
I will check the whole chain from page to miniOcr and report here, also @DiegoPino can add more details about this. Thanks !!
pg_0003.pdf

Answer 8 · 2021-03-16T10:25:30.000Z

@jbaiter When pdf searchable:

we use djvu2hocr to convert single page to hocr
as djvu2hocr output uses ocrx_line while tesseract uses ocr_line we replace ocrx-line with ocr_line
finally calling this function to convert to miniOCR: https://github.com/esmero/strawberry_runners/blob/9e6fc38e0f4c48e6a84bef5214b519f776b877b6/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php#L434

Thanks again

Answer 9 · 2021-03-16T10:29:52.000Z

Sorry, I had a slight misunderstanding, I just noticed that the rapportotecnico comes from the MiniOCR you posted, sorry!
This is very odd behavior, the MiniOCR looks fine, and I don't have any problems with bad tokenization in the unit test.
Could you maybe post your index analysis chain after all? Maybe there's some weird interplay with the tokenizer you're using (the test on my end uses the StandardTokenizerFactory)

Answer 10 · 2021-03-16T10:43:46.000Z

Do you mean this?:

<fieldType name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
  <analyzer type="index">
    <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Answer 11 · 2021-03-16T10:48:38.000Z

And are you using
<tokenizer class="solr.StandardTokenizerFactory"/>
instead of
<tokenizer class="solr.WhitespaceTokenizerFactory"/> ?

Answer 12 · 2021-03-16T10:49:12.000Z

Yes, exactly, thank you! Any reason you're using the WhitespaceTokenizerFactory? This is usually only intended for highly structured content like keyword lists or similar things.
For natural language you'll probably want to use something else that is a bit smarter about things like punctuation.

Answer 13 · 2021-03-16T11:08:42.000Z

@jbaiter Double thanks! I'll check it next hours. Really I don't remember why we are using WhitespaceTokenizerFactory , @DiegoPino could add more info about this. Anyway, we have to check more deeper the right tokenizer, i.e. I see i have to remove punctuation signs also.

Answer 14 · 2021-03-16T11:09:39.000Z

If you use something like the StandardTokenizer, it will remove punctuation for you as part of the tokenization process :-)

Answer 15 · 2021-03-16T11:33:03.000Z

@jbaiter I checked but also with StandardTokenizer I have the same issue (plugin 0.6.0):

Could it depend on how MiniOCR is formatted? any idea to more check? Thanks.

Answer 16 · 2021-03-16T11:40:42.000Z

I just found the issue, I mainly tested the new parser with external OCR sources, but in your case you're loading the OCR from the index itself! Will investigate and get back to you as soon as I've found a fix :-)

Nope :-(

Answer 17 · 2021-03-16T17:03:30.000Z

Sorry, I was on a wrong trail this morning, it does not have to do with the external/stored state after all :-/ Could you do me a favor and paste the exact string value that you get back when you retrieve the document for the "numero 3" document from the index? I.e. the one you get from GET /solr/<collection>/select?id=<id>,fl=text_ocr_stored

Answer 18 · 2021-03-16T17:06:17.000Z

@jbaiter I switched back to 0.5.0, does it matter?

Answer 19 · 2021-03-16T17:06:44.000Z

I can extract both eventually

Answer 20 · 2021-03-16T17:08:09.000Z

No, it shouldn't matter :-) Since you're storing the OCR in the index, the actual stored value is just whatever you posted to the collection when you indexed the document. The plugin version only plays a role afterwards, when the plugin indexes the OCR or highlights it. I want to make sure that the actual OCR that is stored in the index doesn't have any whitespace issues.

Answer 21 · 2021-03-16T17:16:34.000Z

Here:

{
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "tcocr_highlightm_X3b_und_ocr_text":["<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ocr><p xml:id=\"sequence_3\" wh=\"2479 3509\"><b><l><w x=\".119 .045 .07 .011\">Rapporto</w><w x=\".195 .045 .06 .01\">Tecnico,</w><w x=\".262 .048 .056 .006\">numero</w><w x=\".323 .045 .011 .011\">3,</w><w x=\".34 .045 .052 .011\">Agosto</w><w x=\".397 .044 .037 .009\">2016</w></l><l><w x=\".134 .142 .106 .02\">FABB</w><w x=\".255 .142 .182 .027\">Repository</w><w x=\".45 .142 .049 .021\">dal</w><w x=\".511 .145 .138 .024\">progetto</w><w x=\".663 .142 .028 .021\">al</w><w x=\".702 .142 .161 .027\">prototipo.</w></l><l><w x=\".124 .177 .111 .02\">Nuove</w><w x=\".246 .176 .099 .021\">forme</w><w x=\".358 .176 .03 .021\">di</w><w x=\".401 .176 .247 .025\">conservazione,</w><w x=\".662 .176 .213 .021\">condivisione</w></l><l><w x=\".226 .217 .017 .014\">e</w><w x=\".255 .21 .243 .021\">valorizzazione</w><w x=\".51 .21 .03 .021\">di</w><w x=\".553 .217 .091 .02\">opere</w><w x=\".657 .21 .117 .027\">digitali</w></l><l><w x=\".263 .323 .095 .012\">Giancarlo</w><w x=\".365 .323 .069 .014\">Birello,</w><w x=\".442 .324 .052 .011\">Ivano</w><w x=\".502 .323 .063 .014\">Fucile,</w><w x=\".575 .323 .058 .012\">Valter</w><w x=\".639 .323 .097 .011\">Giovanetti</w></l><l><w x=\".411 .349 .093 .01\">Ircres-CNR</w><w x=\".512 .349 .053 .013\">Ufficio</w><w x=\".571 .349 .02 .009\">IT</w></l><l><w x=\".418 .365 .046 .009\">Strada</w><w x=\".47 .365 .035 .009\">delle</w><w x=\".51 .365 .048 .011\">Cacce,</w><w x=\".565 .365 .017 .009\">73</w></l><l><w x=\".432 .38 .043 .009\">10135</w><w x=\".481 .38 .05 .009\">Torino</w><w x=\".537 .38 .033 .012\">Italy</w></l><l><w x=\".44 .426 .05 .011\">Anna</w><w x=\".497 .426 .062 .011\">Perin*</w></l><l><w x=\".409 .452 .093 .01\">Ircres-CNR</w><w x=\".508 .451 .083 .01\">Biblioteca</w></l><l><w x=\".42 .468 .026 .009\">Via</w><w x=\".451 .468 .033 .009\">Real</w><w x=\".49 .468 .067 .012\">Collegio,</w><w x=\".563 .468 .017 .009\">30</w></l><l><w x=\".4 .483 .044 .009\">10024</w><w x=\".449 .483 .08 .009\">Moncalieri</w><w x=\".535 .483 .024 .009\">TO</w><w x=\".564 .483 .033 .012\">Italy</w></l><l><w x=\".119 .563 .109 .01\">ABSTRACT:</w><w x=\".236 .563 .051 .01\">FABB</w><w x=\".293 .563 .056 .013\">project</w><w x=\".355 .563 .066 .012\">(Famine</w><w x=\".428 .563 .028 .01\">and</w><w x=\".462 .564 .046 .011\">Feast,</w><w x=\".515 .564 .044 .01\">Fame</w><w x=\".565 .567 .008 .007\">e</w><w x=\".58 .563 .107 .012\">Abbondanza)</w><w x=\".693 .563 .026 .01\">has</w><w x=\".726 .563 .038 .01\">been</w><w x=\".769 .563 .086 .01\">committed</w><w x=\".861 .563 .02 .013\">by</w></l><l><w x=\".119 .58 .093 .01\">Fondazione</w><w x=\".221 .58 .042 .01\">CRT.</w><w x=\".273 .58 .034 .01\">This</w><w x=\".317 .58 .072 .01\">technical</w><w x=\".397 .581 .048 .011\">report</w><w x=\".453 .58 .067 .013\">analyzes</w><w x=\".53 .58 .024 .01\">the</w><w x=\".563 .58 .074 .013\">strategies</w><w x=\".647 .58 .063 .013\">adopted</w><w x=\".718 .58 .029 .01\">and</w><w x=\".755 .58 .024 .01\">the</w><w x=\".787 .58 .04 .01\">main</w><w x=\".835 .583 .045 .01\">open-</w></l><l><w x=\".12 .599 .051 .007\">source</w><w x=\".182 .596 .068 .01\">software</w><w x=\".259 .596 .041 .01\">used.</w><w x=\".311 .596 .093 .01\">Ircres-CNR</w><w x=\".414 .596 .026 .01\">has</w><w x=\".451 .596 .073 .013\">deployed</w><w x=\".534 .596 .024 .01\">the</w><w x=\".569 .596 .068 .01\">software</w><w x=\".647 .596 .028 .01\">and</w><w x=\".686 .599 .048 .007\">server</w><w x=\".743 .596 .076 .013\">platforms</w><w x=\".83 .596 .017 .01\">of</w><w x=\".856 .596 .024 .01\">the</w></l><l><w x=\".119 .613 .086 .013\">repository,</w><w x=\".216 .613 .015 .01\">in</w><w x=\".242 .616 .008 .007\">a</w><w x=\".261 .613 .086 .01\">virtualized</w><w x=\".358 .613 .028 .01\">and</w><w x=\".396 .613 .08 .01\">redundant</w><w x=\".487 .613 .113 .012\">infrastructure,</w><w x=\".611 .613 .011 .01\">it</w><w x=\".633 .613 .031 .01\">also</w><w x=\".675 .613 .033 .01\">take</w><w x=\".718 .616 .033 .007\">care</w><w x=\".762 .613 .017 .01\">of</w><w x=\".789 .613 .024 .01\">the</w><w x=\".823 .613 .056 .013\">design,</w></l><l><w x=\".119 .629 .104 .013\">development</w><w x=\".23 .629 .028 .01\">and</w><w x=\".264 .63 .103 .011\">management</w><w x=\".372 .629 .017 .01\">of</w><w x=\".395 .629 .024 .01\">the</w><w x=\".425 .629 .032 .01\">web</w><w x=\".464 .629 .046 .013\">portal</w><w x=\".517 .629 .087 .012\">(front-end)</w><w x=\".611 .629 .023 .01\">for</w><w x=\".64 .629 .024 .01\">the</w><w x=\".67 .629 .102 .013\">presentation,</w><w x=\".779 .629 .067 .01\">research</w><w x=\".852 .629 .028 .01\">and</w></l><l><w x=\".119 .645 .084 .013\">consulting</w><w x=\".209 .645 .033 .01\">data</w><w x=\".247 .645 .017 .01\">of</w><w x=\".269 .645 .024 .01\">the</w><w x=\".299 .645 .085 .013\">digitalized</w><w x=\".389 .645 .042 .01\">items</w><w x=\".438 .645 .054 .013\">(lyrics,</w><w x=\".499 .645 .044 .013\">lyrics</w><w x=\".549 .647 .034 .01\">text,</w><w x=\".589 .645 .088 .012\">interviews,</w><w x=\".683 .645 .052 .012\">books,</w><w x=\".741 .645 .063 .013\">poems).</w></l><l><w x=\".12 .695 .04 .009\">KEY</w><w x=\".165 .694 .077 .01\">WORDS:</w><w x=\".25 .698 .102 .01\">open-source,</w><w x=\".358 .694 .078 .012\">islandora,</w><w x=\".442 .694 .086 .013\">repository,</w><w x=\".534 .694 .05 .013\">digital</w><w x=\".591 .694 .063 .012\">archive,</w><w x=\".66 .694 .061 .01\">cultural</w><w x=\".726 .694 .064 .013\">heritage</w></l><l><w x=\".119 .744 .033 .01\">JEL</w><w x=\".157 .744 .069 .01\">CODES:</w><w x=\".234 .744 .03 .01\">Z11</w></l><l><w x=\".119 .864 .202 .001\">____________________</w></l><l><w x=\".12 .886 .119 .013\">*Corresponding</w><w x=\".244 .887 .05 .009\">author:</w><w x=\".302 .887 .178 .012\">anna.perin@ircres.cnr.it</w></l></b></p></ocr>"]}]
  }}

Answer 22 · 2021-03-16T17:19:56.000Z

There you go, the OCR that you feed to the index does not have any whitespace between the words!
The plugin relies on the whitespace in the OCR when parsing it, i.e. <w ...>hello</w><w>world</w> will parse to helloworld. Make sure you don't throw away the whitespace between the ocrx_word spans that you get back from djvu2hocr.

Answer 23 · 2021-03-16T17:22:11.000Z

@jbaiter a last question (I hope) why that happens with 0.6.0 and not with 0.5.0? Anyway thanks really a lot

Answer 24 · 2021-03-16T17:32:40.000Z

Good question! The 0.5.0 code wrapped Lucene's HTMLStripCharFilter. This filter outputs a lot of extra whitespace/newlines between node texts.

For example, this is what your whitespace-less document looked like after being run through the HTMLStripCharFilter:






Rapporto

Tecnico,

numero

3,

Agosto

2016



FABB

Repository

dal

progetto

al

prototipo.



Nuove

forme

di

conservazione,

condivisione



e

valorizzazione

di

opere

digitali



Giancarlo

Birello,

Ivano

Fucile,

Valter

Giovanetti



Ircres-CNR

Ufficio

IT



Strada

delle

Cacce,

73



10135

Torino

Italy



Anna

Perin*



Ircres-CNR

Biblioteca



Via

Real

Collegio,

30



10024

Moncalieri

TO

Italy



ABSTRACT:

FABB

project

(Famine

and

Feast,

Fame

e

Abbondanza)

has

been

committed

by



Fondazione

CRT.

This

technical

report

analyzes

the

strategies

adopted

and

the

main

open-



source

software

used.

Ircres-CNR

has

deployed

the

software

and

server

platforms

of

the



repository,

in

a

virtualized

and

redundant

infrastructure,

it

also

take

care

of

the

design,



development

and

management

of

the

web

portal

(front-end)

for

the

presentation,

research

and



consulting

data

of

the

digitalized

items

(lyrics,

lyrics

text,

interviews,

books,

poems).



KEY

WORDS:

open-source,

islandora,

repository,

digital

archive,

cultural

heritage



JEL

CODES:

Z11



____________________



*Corresponding

author:

anna.perin@ircres.cnr.it

The new parser only outputs whatever whitespace there is in the input document (and normalizes runs of consecutive spaces to a single space character to deal with indentation). If there is no whitespace in the input document, the parsed text will not have any whitespace either.

Answer 25 · 2021-03-16T17:38:15.000Z

Thanks a lot for your time on this, have a nice evening!!! Take into account, here there is a really good bottle of wine waiting for you for when come to Italy!!