Wrong application of the Winkler penalty
ehsan-f opened this issue · 2 comments
Problem 1
The Math.Min(JW_COEF, 1.0 / mtp[THREE])
is not necessary as it should just be JW_COEF
- Jaro-Winkler does not specify replacing the scaling factor with 1/(length of longest string).
For example, if a string is 11 characters long, and JW_COEF = 0.1
, then instead of 0.1, 1/11 is used.
Problem 2
The mtp[2]
(prefix) is the Matches
function is not capped at 4 - and it needs to be capped when being applied to final calculation.
Solution
I would like to suggest the below calculation as a solution:
jw = j + ((JW_COEF * Math.Min(mtp[2], 4)) * (1 - j))
I have cross referenced this with written papers and implementation in other languages such as R. (https://journal.r-project.org/archive/2014-1/loo.pdf)
Hi @ehsan-f, thanks for submitting this!
This project is actually a port of an upstream project written in Java, so we do not want to deviate from how they are doing it. Would you mind filing a bug with the upstream project? Then, if they deem this to be correct and get it fixed, we'll incorporate the changes. I believe the line in question upstream is: https://github.com/tdebatty/java-string-similarity/blob/master/src/main/java/info/debatty/java/stringsimilarity/JaroWinkler.java#L87
Thanks!
Closing this issue until such a time as a fix is implemented upstream.