Index compatibility issue between Lucene 8 and Lucene 9
lintool opened this issue · 1 comments
I encountered an issue with Lucene 9 reading indexes built by Lucene 8.
The exception is something along the lines of:
java.lang.IllegalStateException: unexpected docvalues type SORTED for field 'id' (expected=BINARY). Re-index with correct docvalues type.
The crux of the issue is the following:
In DefaultLuceneDocumentGenerator
, we add the (external) docid as a DocValue
:
// Store the collection docid.
document.add(new StringField(IndexArgs.ID, id, Field.Store.YES));
// This is needed to break score ties by docid.
document.add(new BinaryDocValuesField(IndexArgs.ID, new BytesRef(id)));
So that we can break ties by the docid, in SearchCollection
we have a Sort
:
public static final Sort BREAK_SCORE_TIES_BY_DOCID =
new Sort(SortField.FIELD_SCORE, new SortField(IndexArgs.ID, SortField.Type.STRING_VAL));
The reason we do this is to ensure consistent tie breaking, as outlined in this SIGIR 2019 paper.
@tteofili indicated that this was a Lucene 8/Lucene 9 breaking change, due to this issue: fix SortedDocValues to no longer extend BinaryDocValues.
Reindexing with Lucene 9 fixes this issue.
Related, interesting tidbit:
from SortField.STRING_VAL javadoc: Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons. This is typically slower than STRING, which uses ordinals to do the sorting.