INL/BlackLab

Bug with forward index storage (signed/unsigned problem?)

Closed this issue · 1 comments

The integrated forward index looks at the number of unique terms, and uses only as many bytes as needed to store each term index in a document (the external forward index always used 4 bytes per term in a document).

It seems like there's a bug with this, specifically when 3 bytes are used. The stacktrace below has an ArrayIndexOutOfBoundsException where the incorrect index is 16777215, which is 2^24-1, or the highest value an unsigned 3-byte integer can take, or -1 for a signed 3-byte integer.

There's probably a problem with the (de)serialization of a Java (4-byte) integer to a 3-byte value.

Caused by: java.util.concurrent.ExecutionException: nl.inl.blacklab.exceptions.BlackLabRuntimeException: java.lang.ArrayIndexOutOfBoundsException: Index 16777215 out of bounds for length 771577
	at nl.inl.blacklab.server.search.BlsCacheEntry.get(BlsCacheEntry.java:274)
	at nl.inl.blacklab.server.search.BlsCacheEntry.get(BlsCacheEntry.java:254)
	at nl.inl.blacklab.server.search.BlsCacheEntry.get(BlsCacheEntry.java:23)
	at nl.inl.blacklab.search.results.ResultsStatsDelegate.stats(ResultsStatsDelegate.java:47)
	... 51 more
Caused by: nl.inl.blacklab.exceptions.BlackLabRuntimeException: java.lang.ArrayIndexOutOfBoundsException: Index 16777215 out of bounds for length 771577
	at nl.inl.blacklab.search.results.HitsFromQuery.ensureResultsRead(HitsFromQuery.java:219)
	at nl.inl.blacklab.search.results.ResultsAbstract.ensureAllResultsRead(ResultsAbstract.java:259)
	at nl.inl.blacklab.search.results.HitsFromQuery.hitsProcessedTotal(HitsFromQuery.java:285)
	at nl.inl.blacklab.search.results.HitsFromQuery.resultsProcessedTotal(HitsFromQuery.java:306)
	at nl.inl.blacklab.search.results.ResultsAbstract$1.processedTotal(ResultsAbstract.java:105)
	at nl.inl.blacklab.search.results.ResultCount.processedTotal(ResultCount.java:82)
	at nl.inl.blacklab.searches.SearchCountFromResults.executeInternal(SearchCountFromResults.java:46)
	at nl.inl.blacklab.searches.SearchCountFromResults.executeInternal(SearchCountFromResults.java:17)
	at nl.inl.blacklab.server.search.BlsCacheEntry.executeSearch(BlsCacheEntry.java:137)
	at nl.inl.blacklab.server.search.BlsCacheEntry.lambda$start$0(BlsCacheEntry.java:124)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	... 1 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 16777215 out of bounds for length 771577
	at nl.inl.blacklab.forwardindex.TermsIntegrated.segmentIdsToGlobalIds(TermsIntegrated.java:227)
	at nl.inl.blacklab.codec.BLTerms.termsEqual(BLTerms.java:117)
	at nl.inl.blacklab.search.fimatch.ForwardIndexAccessorIntegrated$ForwardIndexAccessorLeafReaderIntegrated.segmentTermsEqual(ForwardIndexAccessorIntegrated.java:105)
	at nl.inl.blacklab.search.fimatch.ForwardIndexDocumentImpl.segmentTermsEqual(ForwardIndexDocumentImpl.java:153)
	at nl.inl.blacklab.search.matchfilter.MatchFilterSameTokens.evaluate(MatchFilterSameTokens.java:96)
	at nl.inl.blacklab.search.lucene.SpansConstrained.ensureValidHit(SpansConstrained.java:149)
	at nl.inl.blacklab.search.lucene.SpansConstrained.nextStartPosition(SpansConstrained.java:118)
	at nl.inl.blacklab.search.results.SpansReader.advanceSpansToNextHit(SpansReader.java:191)
	at nl.inl.blacklab.search.results.SpansReader.run(SpansReader.java:285)
	at java.base/java.util.ArrayList.forEach(Unknown Source)
	at nl.inl.blacklab.search.results.HitsFromQuery.lambda$ensureResultsRead$3(HitsFromQuery.java:193)
	... 5 more\n'

Confirmed. When storing three-byte values, we ignore the most-significant byte of the original integer. This is fine when the value is between 0 and 2^24-1 (the most-significant byte is 0x00 in that case), but when the value is -1 (which apparently happens sometimes - not 100% sure why), it ignores the sign bit and stores three 0xFF bytes. These are decoded to (0xFF << 16) + (0xFF << 8) + 0xFF, or 2^24-1.

The best solution is probably to see a three-byte integer as a signed value just like short and int. So it would be able to store a range of [-2^23, 2^23-1] (both inclusive).