Using a pre-existing EM object does not work unless all comparison levels are present
Opened this issue · 0 comments
zmbc commented
Note: I am still not 100% confident in my diagnosis here. The title of this issue is my best guess of the error case.
I've seen some confusing behavior with pre-trained EM objects, which I believe I've narrowed down. I cannot get any links to occur (even when the matches are all perfect) when any comparison level is not present in the data being predicted, regardless of the data that the EM object was trained on.
Example:
library(fastLink)
library(data.table)
dfA1 <- data.frame(
foo = c('ABCD', 'AND_NOW_FOR', 'ABCDEFG'),
bar = c(1, 2, 3)
)
dfB1 <- data.frame(
foo = c('ABCD', 'SOMETHING_COMPLETELY_DIFFERENT', 'ABCDEFG'),
bar = c(1, 2, 3)
)
em_obj_test <- fastLink(
dfA = dfA1,
dfB = dfB1,
varnames = c('foo'),
stringdist.match = c('foo'),
partial.match = c('foo'),
estimate.only = TRUE
)
dfA2 <- data.frame(
foo = c('ABCD', 'THIS_SHOULD_NOT_MATTER', 'ABCDEFG'),
bar = c(1, 2, 3)
)
dfB2 <- data.frame(
foo = c('ABCD', 'NEITHER_SHOULD_THIS', 'ABCDEFG'),
bar = c(1, 2, 3)
)
results <- fastLink(
dfA = dfA2,
dfB = dfB2,
varnames = c('foo'),
stringdist.match = c('foo'),
partial.match = c('foo'),
em.obj = em_obj_test
)
results$matches$inds.a # Outputs 1 and 3
dfA2 <- data.frame(
foo = c('ABCD', 'ABCDEFG'),
bar = c(1, 2)
)
dfB2 <- data.frame(
foo = c('ABCD', 'ABCDEFG'),
bar = c(1, 2)
)
results <- fastLink(
dfA = dfA2,
dfB = dfB2,
varnames = c('foo'),
stringdist.match = c('foo'),
partial.match = c('foo'),
em.obj = em_obj_test
)
results$matches$inds.a # No matches
dfA2 <- data.frame(
foo = c('ABCD', 'THIS_SHOULD_NOT_MATTER'),
bar = c(1, 2)
)
dfB2 <- data.frame(
foo = c('ABCD', 'NEITHER_SHOULD_THIS'),
bar = c(1, 2)
)
results <- fastLink(
dfA = dfA2,
dfB = dfB2,
varnames = c('foo'),
stringdist.match = c('foo'),
partial.match = c('foo'),
em.obj = em_obj_test
)
results$matches$inds.a # No matches
In the last two runs, 'ABCD' does not match with itself in the other dataframe, even though it clearly should, I think because both a non-similar string and a partial-match-similar string must be present in addition to the exact match.