markvanderloo/stringdist

Different distance using amatch and stringdist

Closed this issue · 5 comments

Hi Mark, we found a behaviour that seems a bit strange. See the code below:

library(stringdist)
df <-c("IÑIGO")
df[amatch("INIGO", df, method="lv", maxDist=1)]
df[amatch("INIGO", df, method="lv", maxDist=2)]
stringdist("INIGO", "IÑIGO")

library(stringdist)
df <-c("IÑIGO")
df[amatch("INIGO", df, method="lv", maxDist=1)]
[1] NA
df[amatch("INIGO", df, method="lv", maxDist=2)]
[1] "IÑIGO"
stringdist("INIGO", "IÑIGO")
[1] 1

As you can see, the distance between INIGO and IÑIGO is 1. However, in the first amatch execution with maxDist =1 results in NA and the second amatch execution with maxDist=2 a match is found and returns de position 1. We thought it was and encoding problem but we've read in the documentation strings are converted to utf32

Maybe we are missing something else or is this an issue?

Thank you very much.

Hi, thanks for pointing this out!

This may be a < versus <= issue somewhere.

Hi Mark,
would it be possible to know when are you releasing a new version of the package with this issue revised? Or maybe there is a work-around we can use?

Thanks!

Not sure when I get time to look into it, but as a workaround set maxDist=1.01

Hi Mark, we have more feedback and maybe the problem does not relate only to a < or <= .
We followed your instructions and tested with several maxdist values.

library(stringdist)
df <-c("IÑIGO")
df[amatch("INIGO", df, method="lv", maxDist=1)]
[1] NA
df[amatch("INIGO", df, method="lv", maxDist=1.01)]
[1] NA
df[amatch("INIGO", df, method="lv", maxDist=1.5)]
[1] NA
df[amatch("INIGO", df, method="lv", maxDist=1.99)]
[1] NA
df[amatch("INIGO", df, method="lv", maxDist=2)]
[1] "IÑIGO"
stringdist("INIGO", "IÑIGO")
[1] 1

thanks

solved now. It was due to a coercion error, causing misinterpretation of the 'useBytes' argument. I've added a regression test and will release a new version soon.