Decide on how to implement automatic renaming

Question

Decide on how to implement automatic renaming

Opened this issue 10 years ago · 1 comments

I need to decide on how I want to implement automatic payee renaming.

I believe that the best way will be to use a something that implements a Levenshtein distance or a Damerau–Levenshtein distance.

The difference between these two is pretty minimal. a Damerau-Levenshtein distance includes letter transposition while a standard Levenshtein distance does not.

levenshtein("hello", "ehllo") = 2
damerau_levenshtein("hello", "ehllo") = 1

I'm looking at 3 different packages for this:

FuzzyWuzzy (FW)
- Last update Feb 2015
- Fast, but additional overhead for short strings
- Decent API
- Poor documentation
- Probably start off with this one.
- Slightly different implementation: returns a ratio rather than the distance.
python-Levenshtein (L)
- Fast (written in C)
- Last update Feb 2014
- OK API
- GPL
pyxDamerauLevenshtein (DL)
- BSD 3-Clause Licence
- horrible API
- Allows transposition
- slow on long strings

Check out the graph for speed. FuzzyWuzzy has additional overhead for small lengths, but otherwise matches python-Levenshtein (because it uses python-Levenshtein...).

I probably won't have many items that are longer than... 35 characters. So really, I could use DL - even with 1000 entries in the payee table, it would still take less than 100ms to match against them.

Answer 1 · 2015-06-08T00:48:21.000Z

Also take a look at jellyfish. It implements a bunch of different string-similarity algorithms as well as phonetic similarities (not that I'd need that) so I can pick-and-choose which I like best.

Also see http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
and http://cran.r-project.org/web/packages/stringdist/stringdist.pdf