dbmdz/solr-ocrhighlighting

IndexOutOfBoundsException in OcrAlternativesFilterFactory

fd17 opened this issue · 3 comments

fd17 commented

I have encountered a rather weird, extremely hard to reproduce OutOfBoundsException when indexing MiniOcr files with alternatives. It gets thrown at this line. This is the stack trace:

org.apache.solr.common.SolrException: Exception writing document
[...]
Caused by: java.lang.StringIndexOutOfBoundsException: offset 254, count -255, length 288 at java.base/java.lang.String.checkBoundsOffCount(Unknown Source) at java.base/java.lang.String.rangeCheck(Unknown Source) at java.base/java.lang.String.<init>(Unknown Source) at de.digitalcollections.solrocr.lucene.OcrAlternativesFilterFactory$OcrAlternativesFilter.incrementToken(OcrAlternativesFilterFactory.java:168)
[...]

It is hard to reproduce because it has a random component. Certain MiniOcr files sometimes trigger the exception, but not always. I have seen pages that never fail, pages that fail roughly 50% of the time and pages that fail every time they are indexed. It does not seem to depend on file size or special character encodings and the exception-causing negative "count" value seems to be the same every time.

Adding an additional check before line 168 to ensure that (closingIdx - curPos) is >= 0 seemingly prevents the exception and still indexes alternatives correctly, but I'm not really sure if this is the right solution. Any insight on this part of the code would be appreciated.

fd17 commented

Apparently, setting the max token length parameter in solr's standard tokenizer factory to a less conservative value fixes the issue.

Thank you for updating the issue with your findings, I'll try to find out what's happening, this shouldn't happen randomly like that.

fd17 commented

My best guess for the randomness is due to some of the tokenizer's internal shenanigans. It creates a char buffer for the token, which is by default limited to 255 bytes by solr. It also seems to vary slightly on each execution. This isn't a problem in most cases, since the extra buffer content is never seen by the user. But since it is not directly correlated to the number of alternatives or even the number of chars between the w tags, it is rather hard to debug. If you have an unlucky input combination that ends up close to the max buffer length, it can trigger the string exception sometimes, since buffer.length() returns a maximal value of 255, which may cut off the input string too early. This possibly also leads to a buffer overflow memory corruption somewhere, which could explain some of randomness as well, but I haven't checked that. This is not a bug in the plugin, but an internal problem of solr. The developers probably assumed that tokens wouldn't be longer than 255 characters, which is reasonable given how even very long assembled german words only have like 40 characters max.