lnx-search/lnx

Question: Why choose Symspell over Tantivy's built-in fuzzy search?

Opened this issue · 3 comments

Just a question, not a bug. Why did you choose to implement Symspell instead of just using Tantivy's built-in Levenshtein automata for doing fuzzy queries?

I'm just getting into this, but it strikes me that Symspell is going to generate much larger dictionaries, where the Levenshtein code should just traverse the existing FST. Did you run into any performance issues with the existing fuzzy search? Or functionality issues?

So we actually support both, it's just generally I recommend trying fast-fuzzy first, it brings something new to the table and has some nice quality of life features.

The good bits

The reason why I generally push for Fast-Fuzzy/SymSpell is that it gives you considerably more performance (and more stable performance) by drastically cutting down the number of terms that actually need to be considered in the dataset, then it becomes a standard FTS which is very fast.

On the other side, it also lets us do some things that we couldn't reasonably do without it, like word segmentation and compound-aware corrections. For example:

We can correct helloworld into hello world as two separate terms, in cases where the user has probably just forgotten to hit space, the segmented words are then used rather than traditional methods which can't efficiently segment the words. What we're essentially giving the user is similar spell correction to how your phone tends to do it a lot of the time.

What you loose

Probably the biggest issue right now with the system is that the relevancy behaves differently from how it would with traditional fuzzy searching. I.e as of right now we don't have prefix searching however, there is an experimental branch that has it which utilises the lookup table to create possible prefixes.

Another thing we lose is generally, Is the ability currently to do score based on the distance (Right now your score comes from the BM25 score of the text search with the correct words), we could, but we would have to go with a term-by-term system which would, for the most part, lose the ability to do our compound-aware corrections. This for me is something that I don't want to lose as it's a very useful addition, we could modify the algorithm slightly to give us a rough guide on what the distances are per term, but for now, that's just what it's like.

Some other points

but it strikes me that Symspell is going to generate much larger dictionaries

This is true we generate a large lookup table essentially, but this does traditionally scale logarithmically based on your dataset (We don't use the algorithm exactly how it was intended, we build the lookup dictionary based on the data in your index rather than a set of static dictionaries). For example, a dataset of 20k documents uses maybe 50MB ish of memory, probably less. A dataset of 5 million documents with about 2.4 million unique words/tokens uses about 1.8-2GB of memory. A bigger version of the dataset with 27 million documents has about 2.6 million words and uses about 2GB of memory.

Providing your data isn't massively random, the size of your computed dictionary will level off as the dataset gets bigger.
If you're worried about the memory, for reference TypeSense recommends between 33.8GB - 50.6GB RAM for it to operate on that dataset. And in a demo with 28 million books, it apparently uses about 14GB of memory (not including virtual memory/file cache IIRC) compared to lnx which can operate on the dataset with about 1/3 of that memory usage (or less if your OS decides not to pull in data into virtual memory) so I would argue that in 100% of cases, the memory usage is negligible compared to what it provides especially around the performance side which measures like so:

Normal Fuzzy search

Note: The vertical axis range is not the same as fast-fuzzy.

image

Fast Fuzzy

Note: The vertical axis range is not the same as fuzzy.

image

Side by side

image

Thanks, this is a terrific answer.

Why is it that you gain the ability to do word segmentation search with symspell? I'm thinking that the method is to try all of the split points to see what works: hellothere -> [h, ellothere], [he, llothere], etc. Then you look up each fragment in the dictionary. That's a lot of lookups. How does fast-fuzzy enable that?

Under the hood, you're storing the dictionary (and the symspell-added terms) in the default Tantivy fst structure, right?

Why is it that you gain the ability to do word segmentation search with symspell? I'm thinking that the method is to try all of the split points to see what works: hellothere -> [h, ellothere], [he, llothere], etc. Then you look up each fragment in the dictionary. That's a lot of lookups. How does fast-fuzzy enable that?

That's probably best explained by Golf Garbe's Blog Posts but yes it can be potentially hundreds of lookups, but a hundred lookups is far less than the thousands or millions of lookups you'd need to do on your inverted index for each word with your DFA.

Under the hood, you're storing the dictionary (and the symspell-added terms) in the default Tantivy fst structure, right?

We actually don't, the lookup table is kept purely in memory with just an excessive amount of de-duplication i.e there is only ever 1 unique delete or word and then 4 Byte pointers to those words. This reduces memory usage by a reasonable amount.

We build the dictionaries after a commit based on the segment reader's term dictionaries to be exact. So the symspell structure isn't a persisted structure in itself. it just isn't worth it.