kosukeimai/fastLink

getMatches() warning / deduplication

felixhaass opened this issue · 2 comments

Hi,

thanks for the great package & the terrific documentation. It's extremely useful and I think many people are going to use it--I definitely will!

I'm having issues when deduplicating a dataset, where I have only few variables to match on. The goal is to generate a common ID for observations with a very similar string identifier. When I follow the deduplication procedure you sketch in the README, but remove all numeric variables and keep only a handful of the string variables, the getMatches() function produces a warning and the common ID is scrambled up.

Reproducible Example

Here's a reproducible example to illustrate the problem:

library(tidyverse)
library(fastLink)

## Add duplicates
set.seed(123)

dfA <- fastLink::dfA
dfA <- rbind(dfA, dfA[sample(1:nrow(dfA), 10, replace = FALSE),])

## Run fastLink
fl_out_dedupe <- fastLink(
  dfA = dfA, dfB = dfA,
  varnames = c("firstname", "lastname", "city")
)

## Run getMatches
dfA_dedupe <- getMatches(dfA = dfA, dfB = dfA, fl.out = fl_out_dedupe)

## Look at the IDs of the duplicates
names(table(dfA_dedupe$dedupe.ids)[table(dfA_dedupe$dedupe.ids) > 1])

## Show duplicated observation
dfA_dedupe[dfA_dedupe$dedupe.ids == 4,]

This is basically the code from the "deduplication" section from the README file, but I've simply removed most of the matching variables to only three, firstname, lastname, city.

Problem description

Running

dfA_dedupe <- getMatches(dfA = dfA, dfB = dfA, fl.out = fl_out_dedupe)

however, results in the following message:

Warning message:
In dfA$dedupe.ids[dfA$dedupe.ids %in% id.original] <- id.duplicated :
  number of items to replace is not a multiple of replacement length

When we look at the resulting data frame, it's clear that the matched IDs are somehow wrongly assigned:

> dfA_dedupe[dfA_dedupe$dedupe.ids == 4,]
   firstname middlename lastname housenum   streetname          city birthyear dedupe.ids
4     joseph      clyde  mcnulty    30436     49th ave Castro Valley      1961          4
38     david       <NA>  johnson     5300  kilkenny pl       Oakland      1960          4

Clearly, joseph and david don't match on any of the chosen variables.

User-written function works

Interestingly, this problem seems connected to #36. In #36, @mbcann01 provides a user-written function to extract matched pairs from the fastLink-object.

Specifically, if we run the fmr_add_unique() function provided in #36 and follow the procedure described there, we can retrieve the correct IDs.

# run the code from #36 that generates the "fmr_add_unique_id()" function first

# generate group ID
dfA <- unite(dfA, "group", firstname:birthyear, remove = F) 
   
# extract matches  / 'fl_out_dedupe' comes from the code junk above
dfA_dedupe_user <- fmr_add_unique_id(dfA, fl_out_dedupe)

# join with original data
dfA_w_id <- dfA %>% 
 dplyr::left_join(
   dfA_dedupe_user %>% 
     dplyr::select(id, group), 
   by = "group") %>% 
 dplyr::select(id, dplyr::everything(), -group)

dfA_w_id[dfA_w_id$id == 3, ]

gives us

> dfA_w_id[dfA_w_id$id == 3, ]
  id firstname middlename lastname housenum    streetname          city birthyear
4  3    joseph      aaron   joseph     4547  piedmont ave Castro Valley      1948
5  3    joseph      clyde  mcnulty    30436      49th ave Castro Valley      1961

where id indicates the common ID for duplicated matches (similar to dedupe.ids). Here the correct (i.e. the most similar) josephs are matched. (I know they're not the "correct" match, but the goal was to find the most similar ones.)

Summary

Since the user-written function from #36 correctly retrieves the IDs for the most similar matches, the problem lies somehow in the construction of the deduplicated data.frame from getMatches(), and not in the matching process itself.

Let me know if I can provide any additional info / code to help you fix this issue--if there is indeed an issue and I'm not doing something wrong here.

Thanks again for your work!

Best
Felix

@felixhaass Thanks for this, and sorry about the delay in responding! We're working on this in a separate branch (referenced above), and should push a fix in the next few days to Github. This will be in the next release to CRAN as well. We really appreciate the catch.

Great, thanks for the reply & fixing the issue. Looking forward to the release!