chrislit/abydos

DiscountedLevenshtein can be less than Levenshtein....?

chrislit opened this issue · 4 comments

from abydos.distance import *
lev = Levenshtein()
dlev = DiscountedLevenshtein()
lev.dist('cat', 'hat') < dlev.dist('cat', 'hat')

Is this correct, though?

Also, this alignment seem sub-optimal. (I think the l in Neil should be matched with an l in Niall.)

cmp.alignment('Niall', 'Neil')
(2.526064024369237, 'N-iall', 'Neil--')

fixed alignment issue in b04ca90

This is a result of the normalizing term in combination with the discounting function. It's worth re-examining this issue to determine if the supplied discounting functions are good, but it's not a bug.

Do you know of any code example of using abydos for matching two Python string lists by calculating minimal distances?

longRefList = ["Name 0001", "Name 0002", ... "Name 9999"]
mylist = ["Name 2345", "xdsdfj ABCD", "Name x23f"] 
# ... whatever code to calculate, 
# for each item in list 2, the distance & position of closest item in list 1 
# ... to output something like this:
matchOutput = [
    {"dist":0, "position":2344}, 
    {"dist":0.999, "position": 8831}, 
    {"dist":0.5, "position":230}
]

I am particularly interested in using ReesLevenshtein distance. But I wonder how slow could this be.
Do you know if somebody has tried to use abydos for trying to merge pandas dataframes by minimal distance matching between two columns?

Thanks a lot in advance for your advice.
@abubelinha