lanl/pyxDamerauLevenshtein

Calculated distance seems to be incorrect in some situations

paulkernfeld opened this issue · 6 comments

It's definitely possible that I am doing something totally silly here, but I think that this algorithm is returning an incorrect value for a pair of inputs that I tried.

In [1]: from pyxdameraulevenshtein import damerau_levenshtein_distance

In [2]: damerau_levenshtein_distance("49482", "48924")
Out[2]: 4

I believe that the Damerau-Levenshtein distance between these two sequences is actually 3, not 4. The words can be transformed as follows:

After 0 operations: 49482
After 1 operations: 49842 (transpose 8 and 4)
After 2 operations: 49824 (transpose 2 and 4)
After 3 operations: 48924 (transpose 9 and 8)

Left a comment about this on another ticket / pull request a while ago: #3 (comment)

Is it possible that the library calculates the optimal string alignment distance instead of the true Damerau-Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance ?

First off, sorry @CaptSolo, I somehow totally missed your comment on #3. I didn't mean to ignore you! I just discovered that we can use Wolfram|Alpha to compute the DL distance as a sanity check. @paulkernfeld, W|A agrees with you.

I need to run a few tests.

I based my code on this pure Python implementation. I just tested that implementation, and it generates the same result that @paulkernfeld identified in the original post.

I didn't actually realize there were two variants on the DL algorithm until @CaptSolo pointed it out. Michael Homer and I are definitely implementing the optimal string alignment distance and not the true DL distance. Granted, they'll be equivalent in a number of cases but not always.

I think what I'll do is fix the DL methods so that they return the true DL distance. And then I'll keep the optimal string alignment distance code around and add it as a new method so that both are available.

@gfairchild : sounds good re. implementing true DL distance.

However, some users may be surprised when results returned by DL functions change between versions. Would need a good way how to let users know that there is a major change in how these methods work.

(Another option would be to add new methods for both versions and deprecate the old method names so that existing code would break and people would notice & choose the right function for them. But I am certain that this is a better solution.)

@CaptSolo, that's a totally reasonable point (about just changing the output unexpectedly). As for the proper way to handle this, I like the deprecation idea, but I'm certainly open to other thoughts as well.

I'm incredibly busy right now, though, and haven't been able to get to this like I hoped. Through the magic of open source, I'd love it if someone could submit a PR on this issue.

I just released v1.4.1, which, among other things, clarifies both in the README and in the code comments that this implementation is of the optimal string alignment distance algorithm.

I'm definitely willing to receive a PR that adds other algorithms, but I think this is sufficient for now, so I'm closing this issue.