dbmdz/solr-ocrhighlighting

Can't index hOCR documents on Windows

petr-fleischmann opened this issue · 4 comments

Some hOCR can't be parsed (0.6.0 version) becasue they use diacritics chars in content. For example chars: "ůá" words: aráme, ků
Ex hOCR file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract v4.0.0.20181030' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='bbox 0 0 2488 3510; ppageno 0'>
   <div class='ocr_carea' id='block_1_4' title="bbox 2407 1654 2482 3505">
    <p class='ocr_par' id='par_1_4' lang='ces' title="bbox 2407 1654 2482 3505">
     <span class='ocr_line' id='line_1_4' title="bbox 2407 1654 2482 3505; textangle 90; x_size 35; x_descenders 7; x_ascenders 12">
      <span class='ocrx_word' id='word_1_13' title='bbox 2447 2681 2463 2757; x_wconf 0'>aráme</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 2420 2481 2462 2530; x_wconf 66'>ků</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

throws error:

2021-06-11 08:42:33.557 ERROR (qtp1516500233-30) [   x:testOCR] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id ocrdoc-79 to the index; possible analysis error.
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:251)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:289)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
	at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:507)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:145)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:121)
	at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:84)
	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2578)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:780)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:566)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:505)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed to parse the OCR markup, make sure your files are well-formed and your regions start/end on complete tags! (Source was: c:/OCR/MNB_006_045/2ff558170f3aea11a96000155d02ad02.hocr)
	at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:144)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilter.readNextWord(OcrCharFilter.java:29)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilter.read(OcrCharFilter.java:125)
	at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:675)
	at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:898)
	at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:148)
	at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:41)
	at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
	at org.apache.lucene.analysis.en.PorterStemFilter.incrementToken(PorterStemFilter.java:67)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:812)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
	at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:964)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:342)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:289)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:236)
	... 51 more
Caused by: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000
 at [row,col {unknown-source}]: [1,1]
	at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
	at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:728)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3678)
	at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:860)
	at de.digitalcollections.solrocr.formats.hocr.HocrParser.seekToNextWord(HocrParser.java:264)
	at de.digitalcollections.solrocr.formats.hocr.HocrParser.readNext(HocrParser.java:75)
	at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:140)
	... 70 more
Caused by: com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000
 at [row,col {unknown-source}]: [1,1]
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:98)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
	at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647)
	at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4146)
	at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3720)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3676)
	... 74 more

hOCR without diacritics "ů, á" is OK.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract v4.0.0.20181030' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='bbox 0 0 2488 3510; ppageno 0'>
   <div class='ocr_carea' id='block_1_4' title="bbox 2407 1654 2482 3505">
    <p class='ocr_par' id='par_1_4' lang='ces' title="bbox 2407 1654 2482 3505">
     <span class='ocr_line' id='line_1_4' title="bbox 2407 1654 2482 3505; textangle 90; x_size 35; x_descenders 7; x_ascenders 12">
      <span class='ocrx_word' id='word_1_13' title='bbox 2447 2681 2463 2757; x_wconf 0'>arme</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 2420 2481 2462 2530; x_wconf 66'>k</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

Thank you for the detailed bug report, this should be enough to pinpoint the cause of the bug and hopefully find a fix, will report once I've gotten around to probing it (might be a while, currently on parental leave, i.e. will happen when the little one has had a good night and I'm not too swamped with household stuff (-:)

So I just tried to reproduce the issue with the example document from the OP, but it indexes just fine for me 🤔

Can you share the file that causes the issue? I.e. the actual c:/OCR/MNB_006_045/2ff558170f3aea11a96000155d02ad02.hocr file on disk.

Also, could you try running the same setup with the same data inside of a Docker container with a Linux system? The plugin was only tested on Linux and uses a few low-level interfaces that might behave differently on Windows systems, would be good to verify if this is the case.

Thanks for the quick reply

My results are:
OS solr version plugin version status
Windows 10 8.2 0.6.0 NOK
Windows 10 8.2 0.5.0 OK
W10(wsl2 ubuntu + docker -> linux) 8.7 0.6.0 OK

I think you're right. The problem will be in windows (for 0.6.0 version)

2ff558170f3aea11a96000155d02ad02.zip

Thank you! I'll try setting up a windows environment to reproduce and hopefully fix the issue, might take a bit, though 😬