feature23/StringSimilarity.NET

Wrong application of the Winkler penalty

ehsan-f opened this issue · 2 comments

jw = j + Math.Min(JW_COEF, 1.0 / mtp[THREE]) * mtp[2] * (1 - j);

Problem 1
The Math.Min(JW_COEF, 1.0 / mtp[THREE]) is not necessary as it should just be JW_COEF - Jaro-Winkler does not specify replacing the scaling factor with 1/(length of longest string).

For example, if a string is 11 characters long, and JW_COEF = 0.1, then instead of 0.1, 1/11 is used.

Problem 2
The mtp[2] (prefix) is the Matches function is not capped at 4 - and it needs to be capped when being applied to final calculation.

Solution
I would like to suggest the below calculation as a solution:
jw = j + ((JW_COEF * Math.Min(mtp[2], 4)) * (1 - j))

I have cross referenced this with written papers and implementation in other languages such as R. (https://journal.r-project.org/archive/2014-1/loo.pdf)

Hi @ehsan-f, thanks for submitting this!

This project is actually a port of an upstream project written in Java, so we do not want to deviate from how they are doing it. Would you mind filing a bug with the upstream project? Then, if they deem this to be correct and get it fixed, we'll incorporate the changes. I believe the line in question upstream is: https://github.com/tdebatty/java-string-similarity/blob/master/src/main/java/info/debatty/java/stringsimilarity/JaroWinkler.java#L87

Thanks!

Closing this issue until such a time as a fix is implemented upstream.