Bug with forward index storage (signed/unsigned problem?)
Closed this issue · 1 comments
The integrated forward index looks at the number of unique terms, and uses only as many bytes as needed to store each term index in a document (the external forward index always used 4 bytes per term in a document).
It seems like there's a bug with this, specifically when 3 bytes are used. The stacktrace below has an ArrayIndexOutOfBoundsException where the incorrect index is 16777215, which is 2^24-1, or the highest value an unsigned 3-byte integer can take, or -1 for a signed 3-byte integer.
There's probably a problem with the (de)serialization of a Java (4-byte) integer to a 3-byte value.
Caused by: java.util.concurrent.ExecutionException: nl.inl.blacklab.exceptions.BlackLabRuntimeException: java.lang.ArrayIndexOutOfBoundsException: Index 16777215 out of bounds for length 771577
at nl.inl.blacklab.server.search.BlsCacheEntry.get(BlsCacheEntry.java:274)
at nl.inl.blacklab.server.search.BlsCacheEntry.get(BlsCacheEntry.java:254)
at nl.inl.blacklab.server.search.BlsCacheEntry.get(BlsCacheEntry.java:23)
at nl.inl.blacklab.search.results.ResultsStatsDelegate.stats(ResultsStatsDelegate.java:47)
... 51 more
Caused by: nl.inl.blacklab.exceptions.BlackLabRuntimeException: java.lang.ArrayIndexOutOfBoundsException: Index 16777215 out of bounds for length 771577
at nl.inl.blacklab.search.results.HitsFromQuery.ensureResultsRead(HitsFromQuery.java:219)
at nl.inl.blacklab.search.results.ResultsAbstract.ensureAllResultsRead(ResultsAbstract.java:259)
at nl.inl.blacklab.search.results.HitsFromQuery.hitsProcessedTotal(HitsFromQuery.java:285)
at nl.inl.blacklab.search.results.HitsFromQuery.resultsProcessedTotal(HitsFromQuery.java:306)
at nl.inl.blacklab.search.results.ResultsAbstract$1.processedTotal(ResultsAbstract.java:105)
at nl.inl.blacklab.search.results.ResultCount.processedTotal(ResultCount.java:82)
at nl.inl.blacklab.searches.SearchCountFromResults.executeInternal(SearchCountFromResults.java:46)
at nl.inl.blacklab.searches.SearchCountFromResults.executeInternal(SearchCountFromResults.java:17)
at nl.inl.blacklab.server.search.BlsCacheEntry.executeSearch(BlsCacheEntry.java:137)
at nl.inl.blacklab.server.search.BlsCacheEntry.lambda$start$0(BlsCacheEntry.java:124)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 1 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 16777215 out of bounds for length 771577
at nl.inl.blacklab.forwardindex.TermsIntegrated.segmentIdsToGlobalIds(TermsIntegrated.java:227)
at nl.inl.blacklab.codec.BLTerms.termsEqual(BLTerms.java:117)
at nl.inl.blacklab.search.fimatch.ForwardIndexAccessorIntegrated$ForwardIndexAccessorLeafReaderIntegrated.segmentTermsEqual(ForwardIndexAccessorIntegrated.java:105)
at nl.inl.blacklab.search.fimatch.ForwardIndexDocumentImpl.segmentTermsEqual(ForwardIndexDocumentImpl.java:153)
at nl.inl.blacklab.search.matchfilter.MatchFilterSameTokens.evaluate(MatchFilterSameTokens.java:96)
at nl.inl.blacklab.search.lucene.SpansConstrained.ensureValidHit(SpansConstrained.java:149)
at nl.inl.blacklab.search.lucene.SpansConstrained.nextStartPosition(SpansConstrained.java:118)
at nl.inl.blacklab.search.results.SpansReader.advanceSpansToNextHit(SpansReader.java:191)
at nl.inl.blacklab.search.results.SpansReader.run(SpansReader.java:285)
at java.base/java.util.ArrayList.forEach(Unknown Source)
at nl.inl.blacklab.search.results.HitsFromQuery.lambda$ensureResultsRead$3(HitsFromQuery.java:193)
... 5 more\n'
Confirmed. When storing three-byte values, we ignore the most-significant byte of the original integer. This is fine when the value is between 0 and 2^24-1 (the most-significant byte is 0x00 in that case), but when the value is -1 (which apparently happens sometimes - not 100% sure why), it ignores the sign bit and stores three 0xFF bytes. These are decoded to (0xFF << 16) + (0xFF << 8) + 0xFF, or 2^24-1.
The best solution is probably to see a three-byte integer as a signed value just like short and int. So it would be able to store a range of [-2^23, 2^23-1] (both inclusive).