Feature request: option to return list of all matches to facilitate multi-column matching
Opened this issue · 2 comments
I'm writing a function to fuzzy-match records in two datasets according to the following criteria, where a match will only be recorded as such if there is a positive match on all four columns for the same row index:
- First names (method = soundex)
- Surnames (method = soundex)
- Dates of birth (method = dl with maxDist = 1)
- Report date (must be within a certain episode window defined in days, e.g. 30 days)
Initially, for steps 1 to 3 I was using stringdist::amatch()
. However, I realised it is only bringing back the first match in each case, which might not be the correct row index that matches on all the columns, if any values are duplicated. Take the following example data:
library(data.table)
library(dtplyr)
library(dplyr)
# Create first data set (case data):
casedata <- data.table(documents_number = c("N099",
"N052",
"N047",
"N079",
"N088"),
visualId = c("C001",
"C002",
"C005",
"c004",
"c003"),
firstName = c("Sam",
"Leila",
"Jim",
"Julita",
"Jim"),
lastName = c("Tracyiek",
"Crowther",
"Maputo",
"Nlange",
"Dutto"),
dob = c("2001-11-04",
"1980-07-15",
"1995-07-14",
"2017-04-03",
"2019-08-21"),
age_years = c(23,
45,
42,
5,
19),
reportdate = c("2022-08-02",
"2022-06-16",
"2022-07-30",
"2022-08-02",
"2022-08-31"),
onsetdate = c("2022-08-01",
"2022-06-10",
"2022-07-25",
"2022-07-31",
NA))
# Convert date columns:
casedata <- casedata %>%
mutate(across(.cols = c(dob, ends_with("date")), .fns = as.Date)) %>%
as.data.table()
# Create second data set (lab results):
labdata <- data.table(documents_number = c("N099", "N052", "N047", "N123"),
cid = c("C001", "C006", "C003", "c005"),
firstName = c("Sam", "Lila", "Jim", "Ena"),
lastName = c("Tracik", "Crowther", "Dutto", "Lotuto"),
dob = c("2001-11-04", "1980-07-15", "2019-08-12", "1978-12-23"),
age_years = c(23, 45, 19, 29),
result = c("neg", "pos", "neg", "pos"),
sampledate = c("2022-08-02", "2022-08-03", "2022-08-01", "2022-08-02"))
# Convert date columns:
labdata <- labdata %>%
mutate(across(.cols = c(dob, ends_with("date")), .fns = as.Date)) %>%
as.data.table()
This gives the following two data tables to link:
# Data set 1: case data:
> casedata
documents_number visualId firstName lastName dob age_years notdate onsetdate
1: N099 C001 Sam Tracyiek 2001-11-04 23 2022-08-02 2022-08-01
2: N052 C002 Leila Crowther 1980-07-15 45 2022-06-16 2022-06-10
3: N047 C005 Jim Maputo 1995-07-14 42 2022-07-30 2022-07-25
4: N079 c004 Julita Nlange 2017-04-03 5 2022-08-02 2022-07-31
5: N088 c003 Jim Dutto 2019-08-21 19 2022-08-31 <NA>
# Data set 2: lab data:
> labdata
documents_number firstName lastName dob age_years result sampledate
1: N099 Sam Tracik 2001-11-04 23 neg 2022-08-02
2: N052 Lila Crowther 1980-07-15 45 pos 2022-08-03
3: N047 Jim Dutto 2019-08-12 19 neg 2022-08-01
4: N123 Ena Lotuto 1978-12-23 29 pos 2022-08-02
My aim is to identify which rows from casedata
match the rows in labdata
based on the above criteria and append the visualId
number from casedata
to the labdata
when a match is found, else return NA
.
As you can see, there are two cases in casedata
called Jim, however they have different last names. The case that fully matches with Jim Dutto in the labdata is row 5 in case data, but this is not the first Jim in the firstName column. When I run stringdist::amatch()
the matched index returned for Jim in labdata is the Jim in row 3 of casedata, which is not correct as row 3 doesn't match on all the other columns:
# Check first name matches by soundex:
labdata[, fn := stringdist::amatch(x = firstName,
table = casedata$firstName,
method = "soundex")]
# Check surname matches by soundex:
labdata[, sn := stringdist::amatch(x = lastName,
table = casedata$lastName,
method = "soundex")]
# Check dob matches by Damereau-Levenshtein distance (max of 1 permitted):
labdata[, bd := stringdist::amatch(x = dob,
table = casedata$dob,
method = "dl", maxDist = 1)]
# Show match results (fn = first name, sn = last name, bd = birth date):
> labdata
documents_number firstName lastName dob age_years result sampledate fn sn bd
1: N099 Sam Tracik 2001-11-04 23 neg 2022-08-02 1 1 1
2: N052 Lila Crowther 1980-07-15 45 pos 2022-08-03 2 2 2
3: N047 Jim Dutto 2019-08-12 19 neg 2022-08-01 3 5 5
4: N123 Ena Lotuto 1978-12-23 29 pos 2022-08-02 NA NA NA
If I could obtain a list of all the matches instead of just the first one, this would solve my problem as I could then determine which row indices were present in all three lists and only return a match for those.
@EmilBode has created an extension to stringdist::amatch()
which does exactly this (see here). Performing the same operation with EmilMisc::mamatch()
gives:
# Install EmilMisc package which contains mamatch function:
remotes::install_github("Dans-labs/R-package_EmilMisc")
# Check first name matches by soundex:
labdata[, fn := EmilMisc::mamatch(x = firstName,
table = casedata$firstName,
method = "soundex",
maxmatch = nrow(casedata),
returnAs = "list")]
# Check surname matches by soundex:
labdata[, sn := EmilMisc::mamatch(x = lastName,
table = casedata$lastName,
method = "soundex",
maxmatch = nrow(casedata),
returnAs = "list")]
# Check dob matches by Damereau-Levenshtein distance (max of 1 permitted):
labdata[, bd := EmilMisc::mamatch(x = dob,
table = casedata$dob,
method = "dl",
maxDist = 1,
maxmatch = nrow(casedata),
returnAs = "list")]
# Return index that matches in all three columns:
labdata[, casematch := Reduce(intersect, list(unlist(fn), unlist(sn), unlist(bd)))]
# Show results:
> labdata
documents_number firstName lastName dob age_years result sampledate fn sn bd casematch
1: N099 Sam Tracik 2001-11-04 23 neg 2022-08-02 1 1 1 1
2: N052 Lila Crowther 1980-07-15 45 pos 2022-08-03 2 2 2 2
3: N047 Jim Dutto 2019-08-12 19 neg 2022-08-01 3,5 5 5 5
4: N123 Ena Lotuto 1978-12-23 29 pos 2022-08-02 NA NA NA NA
The EmilMisc::mamatch()
function shows both matches for Jim in casedata$firstName
(rows 3 and 5) and taking the intersection of this result with results for lastName and dob gives row 5 as the correct row to match on, which is the desired result.
I can think of three possible ways to incorporate this functionality in stringdist:
- Add the mamatch function to the stringdist package as a separate function
- Modify the existing amatch function in stringdist to include an option allowing to return the first match only or a user-defined number of matches
- Add a new function to stringdist which allows matching on multiple columns with different methods, whereby the columns to match on, methods to use and relevant arguments for those methods are supplied as lists and the output is whether a match has been found using all the listed columns.
Any of these options, according to what would best suit the package purpose and structure, would be a really useful addition for record linkage tasks.
@AmyMikhail did you have a look at the reclin2
package (https://github.com/djvanderlaan/reclin2)? That package (by me) is designed to handle cases like yours.
@AmyMikhail did you have a look at the
reclin2
package (https://github.com/djvanderlaan/reclin2)? That package (by me) is designed to handle cases like yours.
Thanks @djvanderlaan - I had not come across this package before, but it does look interesting.