dougthor42/PyBank

Decide on how to implement automatic renaming

Opened this issue · 1 comments

I need to decide on how I want to implement automatic payee renaming.

I believe that the best way will be to use a something that implements a Levenshtein distance or a Damerau–Levenshtein distance.

The difference between these two is pretty minimal. a Damerau-Levenshtein distance includes letter transposition while a standard Levenshtein distance does not.

levenshtein("hello", "ehllo") = 2
damerau_levenshtein("hello", "ehllo") = 1

I'm looking at 3 different packages for this:

  • FuzzyWuzzy (FW)
    • Last update Feb 2015
    • Fast, but additional overhead for short strings
    • Decent API
    • Poor documentation
    • Probably start off with this one.
    • Slightly different implementation: returns a ratio rather than the distance.
  • python-Levenshtein (L)
    • Fast (written in C)
    • Last update Feb 2014
    • OK API
    • GPL
  • pyxDamerauLevenshtein (DL)
    • BSD 3-Clause Licence
    • horrible API
    • Allows transposition
    • slow on long strings

Check out the graph for speed. FuzzyWuzzy has additional overhead for small lengths, but otherwise matches python-Levenshtein (because it uses python-Levenshtein...).

image

I probably won't have many items that are longer than... 35 characters. So really, I could use DL - even with 1000 entries in the payee table, it would still take less than 100ms to match against them.

Also take a look at jellyfish. It implements a bunch of different string-similarity algorithms as well as phonetic similarities (not that I'd need that) so I can pick-and-choose which I like best.

Also see http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
and http://cran.r-project.org/web/packages/stringdist/stringdist.pdf