Decide on how to implement automatic renaming
Opened this issue · 1 comments
I need to decide on how I want to implement automatic payee renaming.
I believe that the best way will be to use a something that implements a Levenshtein distance or a Damerau–Levenshtein distance.
The difference between these two is pretty minimal. a Damerau-Levenshtein distance includes letter transposition while a standard Levenshtein distance does not.
levenshtein("hello", "ehllo") = 2
damerau_levenshtein("hello", "ehllo") = 1
I'm looking at 3 different packages for this:
- FuzzyWuzzy (FW)
- Last update Feb 2015
- Fast, but additional overhead for short strings
- Decent API
- Poor documentation
- Probably start off with this one.
- Slightly different implementation: returns a ratio rather than the distance.
- python-Levenshtein (L)
- Fast (written in C)
- Last update Feb 2014
- OK API
- GPL
- pyxDamerauLevenshtein (DL)
- BSD 3-Clause Licence
- horrible API
- Allows transposition
- slow on long strings
Check out the graph for speed. FuzzyWuzzy
has additional overhead for small lengths, but otherwise matches python-Levenshtein
(because it uses python-Levenshtein
...).
I probably won't have many items that are longer than... 35 characters. So really, I could use DL
- even with 1000 entries in the payee table, it would still take less than 100ms to match against them.
Also take a look at jellyfish. It implements a bunch of different string-similarity algorithms as well as phonetic similarities (not that I'd need that) so I can pick-and-choose which I like best.
Also see http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
and http://cran.r-project.org/web/packages/stringdist/stringdist.pdf