flaxsearch/luwak

Batch processing doesn't always return offsets for HighlightsMatcher

romseygeek opened this issue · 4 comments

Found this while working on tests for #142. There's no easy way to ask for PostingsEnum.OFFSETS when using the SpanCollector helpers, so we can sometimes get a PostingsEnum with no offsets loaded.

I think the best way to deal with this will be to have a specialised Codec that always loads offsets, but I need to have a play and investigate this properly.

Hi Alan. Can you please clarify what this is about? Do you imagine your discover poses a problem for the Lucene UnifiedHighlighter (in PhraseHelper.java) in theory or is it something unique to Luwak?

The UnifiedHighlighter uses MemoryIndex, I think? In which case you'll be fine, because the MemoryIndex postings enum always makes offsets available (if they're enabled on the MI as a whole).

This comes up when luwak uses DocumentBatches, which internally use a RAMDirectory. If you don't specify PostingsEnum.OFFSETS when calling getSpans(), then offsets don't necessarily get loaded (it depends on the Codec, I think, this is where I need to dig).

The UH uses a MemoryIndex only if the field to be highlighted doesn't have offsets in postings or term vectors.

I believe there will be no issue because UH PhraseHelper only uses Spans as a position filter, not for their offsets. Instead offsets are taking directly from a PostingsEnum (see FieldOffsetStrategy line 95). However this has other highlighter accuracy shortcomings (due to Span position ranges) not being as accurate as using the particular locations of the terms comprising the spans (long story). So I've wanted to migrate to Luwak's approach.

Turns out this was a bug in the SpanRewriter. I already have a SpanOffsetReportingQuery, which wraps a Span*Query and ensures that offsets are loaded, but the rewriter wasn't wrapping PhraseQuery properly, so phrases weren't highlighted during batch processing.