dbmdz/solr-ocrhighlighting

Indexing of ALTO fails with unknown source

Closed this issue · 2 comments

albig commented

I cannot index the following ALTO file:

https://digital.slub-dresden.de/data/kitodo/lebe_364572701-19320800/lebe_364572701-19320800_ocr/00000068.xml

Solr 8.8.8 logs this error:

...
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed to parse the OCR markup, make sure your files are well-formed and your regions start/end on complete tags! (Source was: [unknown])
	at de.digitalcollections.solrocr.formats.OcrParser.(OcrParser.java:109)
	at de.digitalcollections.solrocr.formats.alto.AltoParser.(AltoParser.java:27)
	at de.digitalcollections.solrocr.formats.alto.AltoFormat.getParser(AltoFormat.java:38)
	at de.digitalcollections.solrocr.model.OcrFormat.filter(OcrFormat.java:90)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory.create(OcrCharFilterFactory.java:51)
...
Caused by: com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000
 at [row,col {unknown-source}]: [1,1]
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:98)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
	at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:762)
	at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2713)
	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1073)
	at de.digitalcollections.solrocr.formats.alto.AltoParser.seekToNextWord(AltoParser.java:229)
	at de.digitalcollections.solrocr.formats.alto.AltoParser.readNext(AltoParser.java:49)
	at de.digitalcollections.solrocr.formats.OcrParser.(OcrParser.java:106)

Unfortunately, I have no idea, what is wrong with this file or if this is a bug of solr-ocrhighlighting.

Do you have any advice?

albig commented

The reason for this behaviour seems to be the huge <Polygon>-element. It is about 6000 signs long, but the plugin reads only up to 4000. I'm not sure, where to change the code to rise this value.

For my use case I use another approach now. I convert my ALTO into the proposed MiniOCR-format. In this case, the <Polygon>-element ist dropped anyway.

Thanks for reporting, this is the same bug as reported in #212, a fix is in the pipeline.