mikegoatly/lifti

Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider

mikegoatly opened this issue · 3 comments

An extension of #76 - I've just realised that wildcard field names are going to be a bit problematic. When parsing text from a query, the QueryTokenizer needs to know which index tokenizer to use when processing the search text.

Consider this index:

var index = new FullTextIndexBuilder<int>()
    .WithDefaultTokenization(t => t.WithStemming()) // Stemming on all fields by default
    .WithObjectTokenization<Customer>(o => o
        .WithKey(c => c.Id)
        .WithField(
           "Name", 
           c => c.Name, 
           tokenizationOptions: fo => fo.WithTokenization(t => t)) // No stemming on the Name field
        .WithDynamicFields("Tags", c => c.TagDictionary, "Tag_")
    )
    .Build();

The default index tokenizer uses stemming, whereas the field Name has it's own index tokenizer configured without stemming. If we allowed wildcard field names like this [Na*]=Something then it's no longer clear which tokenizer to use for the search text Something (especially if we ended up with another field starting with Na).

So I think as things stand, the options are:

  1. Support wildcards, but duplicate the search parts for each matched field, e.g . [Tag_*]=foo would be equivalent to searching for [Tag_One]=foo | [Tag_Two]=foo | [Tag_Three]=foo
  2. Support searching across all fields emitted by a named dynamic field provider using some other syntax, e.g. [?Tags]=foo (Syntax TBD). A single dynamic field provider will only ever have one index tokenizer associated to it, so this should work.

The first option would have a performance impact on the query, and we're probably going to need to build in some search optimisations to cache the search results emitted by a query to save the same search predicate being performed multiple times.

The second option is a bit more limited, but at least solves the issue across a specific dynamic field source.

h0lg commented

I understand that in your example it is unclear which tokenizer to apply to the search text if the index itself uses a different tokenizer than the field(s) being searched. I never thought about this configuration and don't have an answer.

But how does lifti decide which tokenizer to use for the search text when searching across all fields with different configured tokenizers? Isn't that a similar question? O am I missing some important difference?

@h0lg If no field is specified, then the currently the default index tokenizer is used to parse and normalize the search text - it's only if a specific field is being searched on, LIFTI uses the index tokenizer that was configured for that.

In that respect, you're right in that searching across all fields will be a problem if different tokenization has been used for them, and that's exactly the same as the problem that needs to be solved here.

I'd need to spend a bit more time thinking about this than I have right now, but I'm wondering if when searching for text across multiple fields:

  • All affected fields are collected (all fields, or a subset when a wildcarded field name is specified)
  • Each unique tokenizer is used to parse the search text.
  • The distinct search terms yielded from the tokenizers are combined with a field filter operator with the appropriate field ids. (A search term in this context could be any number number of tokens if a bracketed statement is encountered)

Edge cases to consider:

  • When searching across all fields, if all tokenizers are the same or all unique tokenizers produce the same search terms, then no field filters need to be applied.

I think this will require quite a bit of rework in the query parser logic, but it's certainly not impossible...

h0lg commented

I see, thanks for the clarification and sharing your thoughts.

Explaining the intricacies of the tokenization during the field search process and what happens in which case seems daunting to me. Maybe we're thinking about it too complicated? You could go with some rule that's easy to communicate and doesn't require you to explain the underlying mechanics - even if it has limitations. e.g.

If you search the same term/query across multiple fields (using wild cards or pipes or whatever), you can only do so if they share the same tokenizer. Otherwise you have write separate field queries.

Would that make things easier?