drewdeponte/alt

Explore alternate matching & ranking algorithms

Closed this issue · 19 comments

After some relatively light research it seems that there are three primary string pattern matching algorithms for the scenario of variable pattern & variable text. They are listed below with their various ~ complexities based on my research.

  • Naive - matching ϴ(nm)
  • Rabin-Karp - (randomized linear)
  • Knuth-Morris-Pratt (KMP) - precomputation ϴ(m) matching ϴ(n)
  • Boyer Moore's & Horspool - matching ϴ(mn)

Based on my initial research, ignoring the fact that we are dealing with paths not just normal strings, it seems that KMP is probably the best option for efficiently finding an exact pattern match in some text.

Beyond the pattern matching algorithms there are a number of ranking or scoring algorithms. This article outlines a few of them and proposes another.

http://www.catalysoft.com/articles/StrikeAMatch.html

The following outlines a number of the ranking/scoring algorithms for similarity as well, such as Edit Distance, etc.
https://www.seochat.com/c/a/choosing-keywords-help/matching-strings-and-algorithms/

Wikipedia has a reference of a number of String Metrics algorithms

https://en.wikipedia.org/wiki/String_metric

So, looks like the primary question that we need to ask in this is do we need a exact pattern match algorithm or a similarity/distance scoring algorithm or maybe some combination of the two?

Once we know the answer to that we could do further comparison.

@russCloak please look this over and give any additional thoughts.

Here is a paper surveying a bunch of text similarity approaches.

http://research.ijcaonline.org/volume68/number13/pxc3887118.pdf

This is another awesome blog post that covers and analyzes a number of these.
http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html

@exAspArk any thoughts?

It might be interesting to also look at selecta and its scoring algorithm and the decisions around it even though it is a slightly different use case.

garybernhardt/selecta#30

Another possible path would be building on top of an existing tool like say ag. We could theoretically use ag -g PATTERN to get back a filtered set of paths or something.

It also seems like we could take a scatter & gather approach with the scoring to increase performance as they would be independent isolated tasks.

What do you think of creating a Huffman Coding Tree or a Suffix Tree. Since they are trees they will both have O(log(n))
The only downside is caching the tree data structure.

Creating the tree maybe actually expensive (I'm gonna actually benchmark the time difference and let you know) but assuming that we'd like to go with one of the algorithms mentioned above, I think Knuth–Morris–Pratt algorithm best fits this use case, because in most cases we are looking for one perfect match

@samdvr both the Huffman Coding Tree and Suffix Tree are primarily useful in scenarios where you have a constant pattern and variable text, or variable pattern and constant text.

Our scenario is variable pattern, variable text.

That isn't to say that they couldn't be used in some hybrid approach. I just don't think we should start with a hybrid approach.

Apparently sinatra is going to be using this so it may be something worth looking at as well.

https://github.com/rkh/mustermann/blob/master/mustermann/README.md

I am closing this as it seems currently we have a performant enough solution. If at some point we decide we need to do further research and exploration we can reopen this discussion.