Indexing of ALTO fails with unknown source
Closed this issue · 2 comments
albig commented
I cannot index the following ALTO file:
Solr 8.8.8 logs this error:
... at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.RuntimeException: Failed to parse the OCR markup, make sure your files are well-formed and your regions start/end on complete tags! (Source was: [unknown]) at de.digitalcollections.solrocr.formats.OcrParser.(OcrParser.java:109) at de.digitalcollections.solrocr.formats.alto.AltoParser.(AltoParser.java:27) at de.digitalcollections.solrocr.formats.alto.AltoFormat.getParser(AltoFormat.java:38) at de.digitalcollections.solrocr.model.OcrFormat.filter(OcrFormat.java:90) at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:51) ... Caused by: com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000 at [row,col {unknown-source}]: [1,1] at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:98) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001) at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:762) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2713) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1073) at de.digitalcollections.solrocr.formats.alto.AltoParser.seekToNextWord(AltoParser.java:229) at de.digitalcollections.solrocr.formats.alto.AltoParser.readNext(AltoParser.java:49) at de.digitalcollections.solrocr.formats.OcrParser.(OcrParser.java:106)
Unfortunately, I have no idea, what is wrong with this file or if this is a bug of solr-ocrhighlighting.
Do you have any advice?
albig commented
The reason for this behaviour seems to be the huge <Polygon>
-element. It is about 6000 signs long, but the plugin reads only up to 4000. I'm not sure, where to change the code to rise this value.
For my use case I use another approach now. I convert my ALTO into the proposed MiniOCR-format. In this case, the <Polygon>
-element ist dropped anyway.