sp1ff/damerau-levenshtein

Issue#2 still exists: attached is a proposed solution with some advantages

Closed this issue · 3 comments

First of all, thank you very much for showing your implementation of the Berghel-Roach algorithm here! For me, who doesn't deal with the topic professionally, the explanations in the authors' article were too cryptic that I wouldn't have dared to write the code from scratch myself. Your example helped me a lot!

However, the explanations in the authors' article are not only cryptic but also incomplete, as issue#2 shows. Now I have noticed that even after your proposed solution to this problem, function f() continues to erroneously overwrite parts of the fixed range of fkp[].

Example of the incorrect behavior: the calculation of dl("bcac", "cc") changes fkp[13] from -4 to -3.

I have found a new solution, which seems better to me for the following reasons:

  1. The overwriting of fixed ranges of fkp[] is completely eliminated.
  2. The code becomes more efficient, since many useless calls of f() are omitted and some case distinctions in f() are no longer needed.

My suggestions for changes to the br.cc file in detail:

  1. Set p in line 88 as follows: p = abs(k);.
  2. Delete the following lines: 36, 37, 39, 50, 51, 53-55, 57, 63, and 65.
  3. Line 38 becomes to: ptrdiff_t t = fkp[(k + zero_k)*max_p+p] + 1;.
  4. Line 41 can be expressed a little more clearly as follows: if (t > 0 && t < m && k + t > 0 && k + t < n) {.
  5. Line 52 becomes to: ptrdiff_t ta = fkp[(k - 1 + zero_k)*max_p+p];.
  6. Line 56 becomes to: ptrdiff_t tb = fkp[(k + 1 + zero_k)*max_p+p] + 1;.
  7. Simplify line 62 as follows: while (t < L && A[t] == B[t+k]) ++t;

Verification: After these changes have been made to the code, fkp[] contains the patterns that Berghel & Roach present in Figures 3 and 5 as examples for a worst-case and an "average" scenario, i.e. dl("ABCDE", "FGHIJ") and dl("AVERY", "GARVEY").

What do you think?

PS: There is still a difference between fkp[] from the article and the one produced by your code: in the unused, variable areas the value -inf is specified instead of 0. For a more universal, more robust variant of the code, this should be changed along with two other small details, see issue#4.

Hello and thank you for this! It's been a while since I've worked on this topic, but I'm context-swapping these algorithms back into my head and starting to look at this.
The present implementation, as you've discovered, is more illustrative than production grade. Perhaps we can do something about that.

Update: I noticed that on p97 of Berghel & Roach, they explicitly assume that m \le n (i.e. the source string is shorter than the target). My code doesn't enforce that. Not sure if that's the entire issue... still checking, but you'll notice that in the algorithm they repeatedly use the quantity n - m without regard to sign.

Pushed a fix.

Going to close this one out. If you think there's still an issue, would you consider cutting a PR? It's easier to look at code differences than textual descriptions.