cut.a for stringdist.method == "lv"
wbakerrobinson opened this issue · 4 comments
Hello,
I assume that "lv" is Levenshtein distance. My understanding is that values of Levenshtein distance range from 0 to the hamming distance given the two strings are of equal length. In this case I am comparing standardized dates of birth by "lv" so this range should hold. The documentation states the cut.a is the lower bound for full string-distance match, ranging between 0 and 1. If I am trying to treat Levenshtein distances of 0 or 1 as matches, what would I input for the cut.a argument? If I would input a number between 0 and 1, how is the Levenshtein distance mapped to that range?
Thanks!
Will
Hi Will,
Yes, "lv" refers to Levenshtein. However, its values have been adjusted to fit within the range of 0 to 1. A value of 0 indicates that the strings are different, while a value of 1 means they are identical. By default, the threshold for agreement (cut.a
) is set at 0.94, but you can modify this value if you find it too strict or too lenient.
If anything, please don't hesitate to let us know.
All my best,
Ted
Hi Ted,
Thank you for the quick response. Can you tell me how you map values from Levenshtein distance to the range [0, 1]?
In the examples below what would a Levenshtein distance of 1 map to? What would a Levenshtein distance of 2 map to?
library(stringdist)
dobA <- "1900-01-01"
dobB <- "1900-01-01"
dobC <- "1900-02-01"
dobD <- "1901-02-01"
dobE <- "3333 33 33"
# How do you map these values from [0, 1]?
# Returns 0 maps to 1 by fastLink
stringdist(dobA, dobB, method = "lv")
# Returns 1 maps to ?
stringdist(dobA, dobC, method = "lv")
# Returns 2 maps to ?
stringdist(dobA, dobD, method = "lv")
# Returns 10 maps to 0 by fastLink
stringdist(dobA, dobE, method = "lv")
Thanks,
Will
Disclaimer: I am a regular fastLink user, not a fastLink developer.
Are you interested in partial string matches? If yes, see the fastLink function gammaCKpar
for the mapping. If no, see the fastLink function gammaCK2par
for the mapping.
Caution: I do not recommend using "lv" with string variable dob
because "lv" assumes that every character is equally important. Usually this assumption is not true with dob
. In the example 1900-01-01
(yyyy-mm-dd), a 1-character change to 1920-01-01
(a 20-year change) is usually more important than the change to 1900-01-21
(a 20-day change). For partial matching, I instead recommend using the numeric variable age
. For exact matching, dob
works well.
I looked into the function gammaCKpar, and found the following:
lv = 1 - (stringdist(stringA, stringB, method = "lv") * 1/max(length(stringA), length(stringB)))
For example 1 above:
lv = 1 - (1 x 1/10)
lv = 0.9
For example 2 above:
lv = 1 - (2 x 1/10)
lv = 0.8
I appreciate your unsolicited feedback on use of matching variables, but I am merely trying to replicate a linkage done by my coworker in another linkage software. The other linkage software has a string comparison that allows for "typos", and this seems to be most closely replicated in fastLink by the use of "lv". Now that I understand how the "lv" is mapped to [0,1] I can set a threshold which allows for a certain number of differences. Some may also find this helpful for a field like zip code.