Explore alternate matching & ranking algorithms

Question

Explore alternate matching & ranking algorithms

Closed this issue 8 years ago · 19 comments

After some relatively light research it seems that there are three primary string pattern matching algorithms for the scenario of variable pattern & variable text. They are listed below with their various ~ complexities based on my research.

Naive - matching ϴ(nm)
Rabin-Karp - (randomized linear)
Knuth-Morris-Pratt (KMP) - precomputation ϴ(m) matching ϴ(n)
Boyer Moore's & Horspool - matching ϴ(mn)

Based on my initial research, ignoring the fact that we are dealing with paths not just normal strings, it seems that KMP is probably the best option for efficiently finding an exact pattern match in some text.

Answer 1 · 2016-04-26T07:41:23.000Z

Beyond the pattern matching algorithms there are a number of ranking or scoring algorithms. This article outlines a few of them and proposes another.

http://www.catalysoft.com/articles/StrikeAMatch.html

Answer 2 · 2016-04-26T07:48:51.000Z

The following outlines a number of the ranking/scoring algorithms for similarity as well, such as Edit Distance, etc.
https://www.seochat.com/c/a/choosing-keywords-help/matching-strings-and-algorithms/

Answer 3 · 2016-04-26T07:52:28.000Z

Wikipedia has a reference of a number of String Metrics algorithms

https://en.wikipedia.org/wiki/String_metric

Answer 4 · 2016-04-26T07:55:25.000Z

Here is a paper on the topic. http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf

Answer 5 · 2016-04-26T07:57:20.000Z

Another article on the topic.
http://www.codeproject.com/Articles/11157/An-improvement-on-capturing-similarity-between-str

Answer 6 · 2016-04-26T08:00:33.000Z

So, looks like the primary question that we need to ask in this is do we need a exact pattern match algorithm or a similarity/distance scoring algorithm or maybe some combination of the two?

Once we know the answer to that we could do further comparison.

Answer 7 · 2016-04-26T08:01:09.000Z

@russCloak please look this over and give any additional thoughts.

Answer 8 · 2016-04-26T08:02:42.000Z

Here is a paper surveying a bunch of text similarity approaches.

http://research.ijcaonline.org/volume68/number13/pxc3887118.pdf

Answer 9 · 2016-04-26T08:10:51.000Z

This is another awesome blog post that covers and analyzes a number of these.
http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html

Answer 10 · 2016-04-26T08:14:01.000Z

@exAspArk any thoughts?

Answer 11 · 2016-04-26T17:18:42.000Z

This is a great article.

https://www.seochat.com/c/a/choosing-keywords-help/matching-strings-and-algorithms/

Answer 12 · 2016-04-26T17:27:42.000Z

It might be interesting to also look at selecta and its scoring algorithm and the decisions around it even though it is a slightly different use case.

garybernhardt/selecta#30

Answer 13 · 2016-04-26T17:29:08.000Z

Another possible path would be building on top of an existing tool like say ag. We could theoretically use ag -g PATTERN to get back a filtered set of paths or something.

Answer 14 · 2016-04-26T17:31:02.000Z

It also seems like we could take a scatter & gather approach with the scoring to increase performance as they would be independent isolated tasks.

Answer 15 · 2016-04-27T03:02:48.000Z

What do you think of creating a Huffman Coding Tree or a Suffix Tree. Since they are trees they will both have O(log(n))
The only downside is caching the tree data structure.

Creating the tree maybe actually expensive (I'm gonna actually benchmark the time difference and let you know) but assuming that we'd like to go with one of the algorithms mentioned above, I think Knuth–Morris–Pratt algorithm best fits this use case, because in most cases we are looking for one perfect match

Answer 16 · 2016-04-27T06:41:55.000Z

@samdvr both the Huffman Coding Tree and Suffix Tree are primarily useful in scenarios where you have a constant pattern and variable text, or variable pattern and constant text.

Our scenario is variable pattern, variable text.

Answer 17 · 2016-04-27T06:42:21.000Z

That isn't to say that they couldn't be used in some hybrid approach. I just don't think we should start with a hybrid approach.

Answer 18 · 2016-04-27T17:15:04.000Z

Apparently sinatra is going to be using this so it may be something worth looking at as well.

https://github.com/rkh/mustermann/blob/master/mustermann/README.md

Answer 19 · 2016-07-21T07:17:17.000Z

I am closing this as it seems currently we have a performant enough solution. If at some point we decide we need to do further research and exploration we can reopen this discussion.