kosukeimai/fastLink

Using a pre-existing EM object does not work unless all comparison levels are present

Opened this issue · 0 comments

zmbc commented

Note: I am still not 100% confident in my diagnosis here. The title of this issue is my best guess of the error case.

I've seen some confusing behavior with pre-trained EM objects, which I believe I've narrowed down. I cannot get any links to occur (even when the matches are all perfect) when any comparison level is not present in the data being predicted, regardless of the data that the EM object was trained on.

Example:

library(fastLink)
library(data.table)

dfA1 <- data.frame(
  foo = c('ABCD', 'AND_NOW_FOR', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

dfB1 <- data.frame(
  foo = c('ABCD', 'SOMETHING_COMPLETELY_DIFFERENT', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

em_obj_test <- fastLink(
  dfA = dfA1,
  dfB = dfB1,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  estimate.only = TRUE
)

dfA2 <- data.frame(
  foo = c('ABCD', 'THIS_SHOULD_NOT_MATTER', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

dfB2 <- data.frame(
  foo = c('ABCD', 'NEITHER_SHOULD_THIS', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

results <- fastLink(
  dfA = dfA2,
  dfB = dfB2,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  em.obj = em_obj_test
)

results$matches$inds.a # Outputs 1 and 3

dfA2 <- data.frame(
  foo = c('ABCD', 'ABCDEFG'),
  bar = c(1, 2)
)

dfB2 <- data.frame(
  foo = c('ABCD', 'ABCDEFG'),
  bar = c(1, 2)
)

results <- fastLink(
  dfA = dfA2,
  dfB = dfB2,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  em.obj = em_obj_test
)

results$matches$inds.a # No matches

dfA2 <- data.frame(
  foo = c('ABCD', 'THIS_SHOULD_NOT_MATTER'),
  bar = c(1, 2)
)

dfB2 <- data.frame(
  foo = c('ABCD', 'NEITHER_SHOULD_THIS'),
  bar = c(1, 2)
)

results <- fastLink(
  dfA = dfA2,
  dfB = dfB2,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  em.obj = em_obj_test
)

results$matches$inds.a # No matches

In the last two runs, 'ABCD' does not match with itself in the other dataframe, even though it clearly should, I think because both a non-similar string and a partial-match-similar string must be present in addition to the exact match.