Can't index hOCR documents on Windows
petr-fleischmann opened this issue · 4 comments
Some hOCR can't be parsed (0.6.0 version) becasue they use diacritics chars in content. For example chars: "ůá" words: aráme, ků
Ex hOCR file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract v4.0.0.20181030' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='bbox 0 0 2488 3510; ppageno 0'>
<div class='ocr_carea' id='block_1_4' title="bbox 2407 1654 2482 3505">
<p class='ocr_par' id='par_1_4' lang='ces' title="bbox 2407 1654 2482 3505">
<span class='ocr_line' id='line_1_4' title="bbox 2407 1654 2482 3505; textangle 90; x_size 35; x_descenders 7; x_ascenders 12">
<span class='ocrx_word' id='word_1_13' title='bbox 2447 2681 2463 2757; x_wconf 0'>aráme</span>
<span class='ocrx_word' id='word_1_15' title='bbox 2420 2481 2462 2530; x_wconf 66'>ků</span>
</span>
</p>
</div>
</div>
</body>
</html>
throws error:
2021-06-11 08:42:33.557 ERROR (qtp1516500233-30) [ x:testOCR] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id ocrdoc-79 to the index; possible analysis error.
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:251)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:289)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:507)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:145)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:121)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:84)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2578)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:780)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:566)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:505)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed to parse the OCR markup, make sure your files are well-formed and your regions start/end on complete tags! (Source was: c:/OCR/MNB_006_045/2ff558170f3aea11a96000155d02ad02.hocr)
at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:144)
at de.digitalcollections.solrocr.lucene.filters.OcrCharFilter.readNextWord(OcrCharFilter.java:29)
at de.digitalcollections.solrocr.lucene.filters.OcrCharFilter.read(OcrCharFilter.java:125)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:675)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:898)
at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:148)
at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:41)
at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
at org.apache.lucene.analysis.en.PorterStemFilter.incrementToken(PorterStemFilter.java:67)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:812)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:964)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:342)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:289)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:236)
... 51 more
Caused by: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000
at [row,col {unknown-source}]: [1,1]
at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:728)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3678)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:860)
at de.digitalcollections.solrocr.formats.hocr.HocrParser.seekToNextWord(HocrParser.java:264)
at de.digitalcollections.solrocr.formats.hocr.HocrParser.readNext(HocrParser.java:75)
at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:140)
... 70 more
Caused by: com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000
at [row,col {unknown-source}]: [1,1]
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:98)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4146)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3720)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3676)
... 74 more
hOCR without diacritics "ů, á" is OK.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract v4.0.0.20181030' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='bbox 0 0 2488 3510; ppageno 0'>
<div class='ocr_carea' id='block_1_4' title="bbox 2407 1654 2482 3505">
<p class='ocr_par' id='par_1_4' lang='ces' title="bbox 2407 1654 2482 3505">
<span class='ocr_line' id='line_1_4' title="bbox 2407 1654 2482 3505; textangle 90; x_size 35; x_descenders 7; x_ascenders 12">
<span class='ocrx_word' id='word_1_13' title='bbox 2447 2681 2463 2757; x_wconf 0'>arme</span>
<span class='ocrx_word' id='word_1_15' title='bbox 2420 2481 2462 2530; x_wconf 66'>k</span>
</span>
</p>
</div>
</div>
</body>
</html>
Thank you for the detailed bug report, this should be enough to pinpoint the cause of the bug and hopefully find a fix, will report once I've gotten around to probing it (might be a while, currently on parental leave, i.e. will happen when the little one has had a good night and I'm not too swamped with household stuff (-:)
So I just tried to reproduce the issue with the example document from the OP, but it indexes just fine for me 🤔
Can you share the file that causes the issue? I.e. the actual c:/OCR/MNB_006_045/2ff558170f3aea11a96000155d02ad02.hocr
file on disk.
Also, could you try running the same setup with the same data inside of a Docker container with a Linux system? The plugin was only tested on Linux and uses a few low-level interfaces that might behave differently on Windows systems, would be good to verify if this is the case.
Thanks for the quick reply
My results are:
OS solr version plugin version status
Windows 10 8.2 0.6.0 NOK
Windows 10 8.2 0.5.0 OK
W10(wsl2 ubuntu + docker -> linux) 8.7 0.6.0 OK
I think you're right. The problem will be in windows (for 0.6.0 version)
Thank you! I'll try setting up a windows environment to reproduce and hopefully fix the issue, might take a bit, though 😬