castorini/anserini

Index compatibility issue between Lucene 8 and Lucene 9

lintool opened this issue · 1 comments

I encountered an issue with Lucene 9 reading indexes built by Lucene 8.
The exception is something along the lines of:

java.lang.IllegalStateException: unexpected docvalues type SORTED for field 'id' (expected=BINARY). Re-index with correct docvalues type.

The crux of the issue is the following:

In DefaultLuceneDocumentGenerator, we add the (external) docid as a DocValue:

    // Store the collection docid.
    document.add(new StringField(IndexArgs.ID, id, Field.Store.YES));
    // This is needed to break score ties by docid.
    document.add(new BinaryDocValuesField(IndexArgs.ID, new BytesRef(id)));

So that we can break ties by the docid, in SearchCollection we have a Sort:

  public static final Sort BREAK_SCORE_TIES_BY_DOCID =
      new Sort(SortField.FIELD_SCORE, new SortField(IndexArgs.ID, SortField.Type.STRING_VAL));

The reason we do this is to ensure consistent tie breaking, as outlined in this SIGIR 2019 paper.

@tteofili indicated that this was a Lucene 8/Lucene 9 breaking change, due to this issue: fix SortedDocValues to no longer extend BinaryDocValues.

Reindexing with Lucene 9 fixes this issue.

Related, interesting tidbit:

from SortField.STRING_VAL javadoc: Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons. This is typically slower than STRING, which uses ordinals to do the sorting.

Closed by #1953 .