markvanderloo/stringdist

Feature request: option to return list of all matches to facilitate multi-column matching

Opened this issue · 2 comments

I'm writing a function to fuzzy-match records in two datasets according to the following criteria, where a match will only be recorded as such if there is a positive match on all four columns for the same row index:

  1. First names (method = soundex)
  2. Surnames (method = soundex)
  3. Dates of birth (method = dl with maxDist = 1)
  4. Report date (must be within a certain episode window defined in days, e.g. 30 days)

Initially, for steps 1 to 3 I was using stringdist::amatch(). However, I realised it is only bringing back the first match in each case, which might not be the correct row index that matches on all the columns, if any values are duplicated. Take the following example data:

library(data.table)
library(dtplyr)
library(dplyr)

# Create first data set (case data):
casedata <- data.table(documents_number = c("N099", 
                                            "N052", 
                                            "N047", 
                                            "N079", 
                                            "N088"), 
                      visualId = c("C001", 
                                   "C002", 
                                   "C005", 
                                   "c004",
                                   "c003"), 
                      firstName = c("Sam", 
                                    "Leila", 
                                    "Jim", 
                                    "Julita", 
                                    "Jim"), 
                      lastName = c("Tracyiek", 
                                   "Crowther", 
                                   "Maputo", 
                                   "Nlange", 
                                   "Dutto"), 
                      dob = c("2001-11-04", 
                              "1980-07-15", 
                              "1995-07-14", 
                              "2017-04-03",
                              "2019-08-21"),
                      age_years = c(23, 
                                    45, 
                                    42, 
                                    5,
                                    19),
                      reportdate = c("2022-08-02", 
                                  "2022-06-16", 
                                  "2022-07-30", 
                                  "2022-08-02", 
                                  "2022-08-31"),
                      onsetdate = c("2022-08-01", 
                                    "2022-06-10", 
                                    "2022-07-25", 
                                    "2022-07-31", 
                                    NA))

# Convert date columns:
casedata <- casedata %>% 
  mutate(across(.cols = c(dob, ends_with("date")), .fns = as.Date)) %>% 
  as.data.table()

# Create second data set (lab results):
labdata <- data.table(documents_number = c("N099", "N052", "N047", "N123"), 
                      cid = c("C001", "C006", "C003", "c005"), 
                      firstName = c("Sam", "Lila", "Jim", "Ena"), 
                      lastName = c("Tracik", "Crowther", "Dutto", "Lotuto"), 
                      dob = c("2001-11-04", "1980-07-15", "2019-08-12", "1978-12-23"),
                      age_years = c(23, 45, 19, 29),
                      result = c("neg", "pos", "neg", "pos"), 
                      sampledate = c("2022-08-02", "2022-08-03", "2022-08-01", "2022-08-02"))

# Convert date columns:
labdata <- labdata %>% 
  mutate(across(.cols = c(dob, ends_with("date")), .fns = as.Date)) %>% 
  as.data.table()

This gives the following two data tables to link:

# Data set 1: case data:
> casedata
   documents_number visualId firstName lastName        dob age_years    notdate  onsetdate
1:             N099     C001       Sam Tracyiek 2001-11-04        23 2022-08-02 2022-08-01
2:             N052     C002     Leila Crowther 1980-07-15        45 2022-06-16 2022-06-10
3:             N047     C005       Jim   Maputo 1995-07-14        42 2022-07-30 2022-07-25
4:             N079     c004    Julita   Nlange 2017-04-03         5 2022-08-02 2022-07-31
5:             N088     c003       Jim    Dutto 2019-08-21        19 2022-08-31       <NA>

# Data set 2: lab data:
> labdata
   documents_number firstName lastName        dob age_years result sampledate
1:             N099       Sam   Tracik 2001-11-04        23    neg 2022-08-02
2:             N052      Lila Crowther 1980-07-15        45    pos 2022-08-03
3:             N047       Jim    Dutto 2019-08-12        19    neg 2022-08-01
4:             N123       Ena   Lotuto 1978-12-23        29    pos 2022-08-02

My aim is to identify which rows from casedata match the rows in labdata based on the above criteria and append the visualId number from casedata to the labdata when a match is found, else return NA.

As you can see, there are two cases in casedata called Jim, however they have different last names. The case that fully matches with Jim Dutto in the labdata is row 5 in case data, but this is not the first Jim in the firstName column. When I run stringdist::amatch() the matched index returned for Jim in labdata is the Jim in row 3 of casedata, which is not correct as row 3 doesn't match on all the other columns:

# Check first name matches by soundex:
labdata[, fn := stringdist::amatch(x = firstName, 
                                       table = casedata$firstName, 
                                       method = "soundex")]

# Check surname matches by soundex:
labdata[, sn := stringdist::amatch(x = lastName, 
                                       table = casedata$lastName, 
                                       method = "soundex")]

# Check dob matches by Damereau-Levenshtein distance (max of 1 permitted):
labdata[, bd := stringdist::amatch(x = dob, 
                                   table = casedata$dob, 
                                   method = "dl", maxDist = 1)]

# Show match results (fn = first name, sn = last name, bd = birth date):
> labdata
   documents_number firstName lastName        dob age_years result sampledate fn sn bd
1:             N099       Sam   Tracik 2001-11-04        23    neg 2022-08-02  1  1  1
2:             N052      Lila Crowther 1980-07-15        45    pos 2022-08-03  2  2  2
3:             N047       Jim    Dutto 2019-08-12        19    neg 2022-08-01  3  5  5
4:             N123       Ena   Lotuto 1978-12-23        29    pos 2022-08-02 NA NA NA

If I could obtain a list of all the matches instead of just the first one, this would solve my problem as I could then determine which row indices were present in all three lists and only return a match for those.

@EmilBode has created an extension to stringdist::amatch() which does exactly this (see here). Performing the same operation with EmilMisc::mamatch() gives:

# Install EmilMisc package which contains mamatch function:
remotes::install_github("Dans-labs/R-package_EmilMisc")

# Check first name matches by soundex:
labdata[, fn := EmilMisc::mamatch(x = firstName, 
                                                       table = casedata$firstName, 
                                                       method = "soundex", 
                                                       maxmatch = nrow(casedata), 
                                                       returnAs = "list")]

# Check surname matches by soundex:
labdata[, sn := EmilMisc::mamatch(x = lastName, 
                                                        table = casedata$lastName, 
                                                        method = "soundex", 
                                                        maxmatch = nrow(casedata), 
                                                        returnAs = "list")]

# Check dob matches by Damereau-Levenshtein distance (max of 1 permitted):
labdata[, bd := EmilMisc::mamatch(x = dob, 
                                                        table = casedata$dob, 
                                                        method = "dl", 
                                                        maxDist = 1, 
                                                        maxmatch = nrow(casedata), 
                                                        returnAs = "list")]

# Return index that matches in all three columns:
labdata[, casematch := Reduce(intersect, list(unlist(fn), unlist(sn), unlist(bd)))]

# Show results:
> labdata
   documents_number firstName lastName        dob age_years result sampledate  fn sn bd casematch
1:             N099       Sam   Tracik 2001-11-04        23    neg 2022-08-02   1  1  1         1
2:             N052      Lila Crowther 1980-07-15        45    pos 2022-08-03   2  2  2         2
3:             N047       Jim    Dutto 2019-08-12        19    neg 2022-08-01 3,5  5  5         5
4:             N123       Ena   Lotuto 1978-12-23        29    pos 2022-08-02  NA NA NA        NA

The EmilMisc::mamatch() function shows both matches for Jim in casedata$firstName (rows 3 and 5) and taking the intersection of this result with results for lastName and dob gives row 5 as the correct row to match on, which is the desired result.

I can think of three possible ways to incorporate this functionality in stringdist:

  1. Add the mamatch function to the stringdist package as a separate function
  2. Modify the existing amatch function in stringdist to include an option allowing to return the first match only or a user-defined number of matches
  3. Add a new function to stringdist which allows matching on multiple columns with different methods, whereby the columns to match on, methods to use and relevant arguments for those methods are supplied as lists and the output is whether a match has been found using all the listed columns.

Any of these options, according to what would best suit the package purpose and structure, would be a really useful addition for record linkage tasks.

@AmyMikhail did you have a look at the reclin2 package (https://github.com/djvanderlaan/reclin2)? That package (by me) is designed to handle cases like yours.

@AmyMikhail did you have a look at the reclin2 package (https://github.com/djvanderlaan/reclin2)? That package (by me) is designed to handle cases like yours.

Thanks @djvanderlaan - I had not come across this package before, but it does look interesting.