pisa-engine/pisa

Effort required to support phrase matches (for full Tantivy's benchmark game comparison)

jjsaas opened this issue · 8 comments

How much effort would it require to provide enough support for phrase match for a full comparison in Tantivy's benchmark game comparison? Would treating a phrase as a term be sufficient?

The benchmark is at:
https://tantivy-search.github.io/bench/

Hi @jjsaas

This is a good question. Unfortunately, I think there's no shortcut here.

Would treating a phrase as a term be sufficient?

It's not that simple. In theory, you could (at index time) expand all adjacent pairs to become their own posting list and behave like any term at query time. However, there are some issues with this approach. First, you can only support up to some n consecutive terms, because it's virtually impossible (or at least prohibitive) to index tuples of all possible lengths. Furthermore, each length you support would be roughly the size of the original index. E.g., indexing all adjacent pairs would take up about as much space as indexing single terms; then, indexing triples would take about as much, too, etc.

Thus, I think this type of approach only makes sense as an optimization. For example, if you can determine what tuples would be queried often, then it makes sense to maybe build that. But then you can also argue that dynamically caching queries could be more straight forward.


I believe that to really support phrases, we would need a positional index, which is not supported currently, and we have never worked on it as far as I know.

Note that most search engines that support phrases also give you slop parameter, which is how many of terms you may allow to be missing, which makes it a little more complicated still.

@JMMackenzie @amallia does that make sense? Let me know if you have anything to add.


To sum up, it would be a significant effort to implement it, though possible.

Appreciate the clarification. Hopefully there will be some sort of phrase matching support by the time the project hits 1.0

If there's interest, we can design a positional inverted index. Even easier, though, is to integrate a forward index and handle phrase matching as a two-stage retrieval process. I'm not sure either fits that well into PISA; probably positional indexes are the way to go, but would be a lot of work.

I think the two-stage is neither interesting or that useful, because it doesn't really do the same thing, does it? To clarify, you mean: get results, filter out based on forward index, right?

I feel like this would likely filter out a lot, so you'd need larger k, but then how to estimate that, and it's never guaranteed, etc. I think it's an overkill, we might as well implement positional index and do it at the time of retrieval.

I personally think it's something I'd like to see in PISA, but of course there's always a problem of resources. The way I see it, there's more pressing issues, lots of remaining refactoring to do, and some important structural changes to make PISA easier to use and extend.

But if we got some volunteer to help out, that would be nice ;)

I think to support exact phrase matching you can just do conjunction on the terms and then check those documents via forward index - I don't think it's wrong in any way.

For the other query type (required terms like +griffith +observatory) we can just re-write the cursor interface to yield a threshold that is the sum of required terms -- this would then ensure only documents that contain required terms are scored by dynamic pruning algorithms.

But yes, I agree in general that phrase search with positional indexes is more flexible, useful, and interesting.

What do you mean by "check documents via forward index"? I assumed you meant after retrieval, hence my point about filtering. On the other hand, if we do it during the retrieval as we go, I assumed it would be very slow, but maybe I'm incorrect? I suppose each matching (by conjunction) document would need a single pass, basically a single run of KMP algorithm but using ints (termIds) instead of characters 🤷 Now I'm kind of curious to see how fast/slow this would be... Certainly very slow for exhaustive processing, but BMW?

Of course it would get a bit more complex with slop parameter.

@JMMackenzie do you have any idea how phrase search is implemented in, say, Lucene? I've never actually looked into implementation details...

Yeah, this was my idea, but my point was that you wouldn't do a disjunctive top-k query for your first stage, you'd just do a Boolean conjunction. So yes, two stage, but also probably very fast. (My assumption was that phrase queries were not ranked, I am probably wrong 🙃)

Not too sure about Lucene, I'd expect a positional invidx.

Ah, ok, I get it now; yes, I was thinking top-k, but if you do a full conjunction and then filter, that would work.

The problem is that our non-ranked conjunction query doesn't score, I believe, and you wouldn't get the dynamic pruning that you may get if you combined, say, block_max_ranked_and_query with additional filter.

Anyway, there's some stuff to think about. I've been thinking of refactoring how query/scoring/retrieval is designed to make it easier to extend and modify. If done, this may be easier to implement.