question - view posterior for all pairs of rows
mbcann01 opened this issue · 4 comments
Hey guys,
Thank you for developing this package. I think it will be really useful to me. It seems like the default behavior for fastLink::fastLink
is to return the indices and posterior probabilities for matching rows only.
Is there a way to view the posterior probabilities for ALL pairs of rows?
I'm interested in knowing if there were any pairs that were NOT matched, but should be.
Thank you,
Brad
Hi Brad, glad to hear it's been helpful to you. There are a few options.
First, if you set return.all = TRUE
(the default), then fastLink()
will return all matches with a posterior match probability above 0.0001, and will it automatically dedupe all matches. In many settings, if we don't dedupe, the resulting object is very large (nearly N_a x N_b rows).
However, if you have enough memory to return the indices for all pairwise comparisons, you can simply set dedupe.matches = FALSE
and return.all = TRUE
. fastLink()
will spit out a warning that says "You have specified that all matches be returned but are not deduping the matches. The resulting object may be very large." This will return the indices and posteriors for all pairwise comparisons where the posterior match probability is greater than 0.0001, without deduping.
If you want to set the threshold even lower, you can override this in a slightly convoluted way, which we should probably make easier to do. Set return.all = FALSE
, dedupe.matches = TRUE
, and threshold.match = [some even smaller value]
.
Let me know if you have any more questions - I hope this is helpful!
Ben
Hi Ben,
Thank you very much for the response!
It's possible that I'm still doing something wrong, but I don't think your solution works in my particular situation because I'm trying to find matches within a single data set rather than two data sets. I should have been more specific from the begininning. Below I go into greater detail:
Background
In my data, each row represents a unique treatment interaction with emergency medical services (“incident”). However, some people were treated at more than once. Therefore, incident
number doesn't necessarily uniquely identify people. For some measures (e.g., demographic characteristics of the people we treated) we really want to be able to do analysis at the person level.
Initially, we attempted to uniquely identify people based on a combination of name and DOB (i.e., just using the paste
function). However, creating a unique identifier this way is complicated by data entry errors (i.e., misspelled names and mistyped dates of birth). For example, john_smith_1936_10_03 and jon_smith_1936_10_03 may be the same person, but would generate two different unique id numbers.
So, we began using the RecordLinkage package to find matches and assign unique id numbers. This worked really well for the analysis of some of our pilot data; however, we are now analyzing data from a much longer time period, and I’d really like to switch over to using fastLink
for some of the speed and scalability reasons that you mention in your publications.
Here's a small reproducible example of what I'm trying to do and the issue I'm encountering:
Simulate data
Below, we have the following people:
-
John Smith (unique person 1)
-
Jane Smith (unique person 2)
-
Joy Smith - John's wife who lives at the same address (unique person 3)
-
Amy Jones - no matches (unique person 4)
df <- tibble(
incident = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008),
nm_first = c("john", "john", "jane", "jon", "jane", "joy", "michael", "amy"),
nm_last = c(rep("smith", 7), "jones"),
sex = c("m", "m", "f", "m", "f", "f", "m", "f"),
birth_mnth = c(9, 9, 2, 9, 3, 8, 9, 1),
birth_year = c(1936, 1936, 1937, 1936, 1937, 1941, 1936, 1947),
add_num = c(101, 101, 14, 101, 14, 101, 101, 1405),
add_street = c("main", "main", "elm", "main", "elm", "main", "main", "texas")
) %>%
mutate(row = row_number()) %>%
select(row, everything()) %>%
print()
I've also included the following data entry errors/matching challenges to resolve:
-
In row 4, the "h" is missing from "john"
-
In row 5, there is a "3" for birth_mnth instead of a "2"
-
In row 7, the medic entered John Smith's middle name, "michael", in the first name column.
-
Joy and John have a lot of information in common, but we want to be able to classify them as a non-match.
#> # A tibble: 8 x 9
#> row incident nm_first nm_last sex birth_mnth birth_year add_num add_street
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 1001 john smith m 9 1936 101 main
#> 2 2 1002 john smith m 9 1936 101 main
#> 3 3 1003 jane smith f 2 1937 14 elm
#> 4 4 1004 jon smith m 9 1936 101 main
#> 5 5 1005 jane smith f 3 1937 14 elm
#> 6 6 1006 joy smith f 8 1941 101 main
#> 7 7 1007 michael smith m 9 1936 101 main
#> 8 8 1008 amy jones f 1 1947 1405 texas
Reduce search space
In the real data, if we attempt to determine if every individual row in the entire data set is a match with every other individual row, it is a costly process. It becomes much more tractable if we temporarily reduce the data to one row for each combination of values that we want to investigate. In this case, we want to investigate all unique combinations of name, DOB, and address.
df_unique_combo <- df %>%
select(-row) %>%
mutate(group = paste(nm_first, nm_last, birth_year, birth_mnth, add_num, add_street, sep = "_")) %>%
group_by(group) %>%
filter(row_number() == 1) %>%
ungroup()
Incident 1002 was dropped because it didn't have any unique information.
Use fastLink
Find unique people with fastLink
fastlink_out <- fastLink::fastLink(
dfA = df_unique_combo,
dfB = df_unique_combo,
varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
stringdist.match = c("nm_first", "nm_last", "add_street"),
numeric.match = c("birth_mnth", "birth_year", "add_num"),
dedupe.matches = FALSE,
return.all = TRUE
)
As you can see in the screenshot above, even with return.all
set to TRUE
, the default behavior of fastLink
is dfA and dfB are identical, assuming deduplication of a single data set. Setting return.all to FALSE.
I believe it's due to this part of the fastLink
code:
if (identical(dfA, dfB)) {
cat("dfA and dfB are identical, assuming deduplication of a single data set.\nSetting return.all to FALSE.\n\n")
dedupe.matches <- FALSE
return.all <- FALSE
dedupe.df <- TRUE
}
Potential solution
So, one potential solution I can think of is to copy the fastLink
code into my own function and just comment out that section of code. Would this break something later that I'm not thinking of (i.e., I'm assuming there was a good reason to include that check in the first place)? Can you think of a better solution?
Also, for the sake of completeness I'll go ahead and outline the rest of the process and challenges for my particular use case – just in case it’s useful to you or others.
View matches
Next, I want to manually review matches/non-matches for false positives and false negatives. In a large data set, I would probably want to do this for some small subset in the middle range of posterior probabilities (i.e., I trust that extremely high posteriors are true matches and extremely low posteriors are true non-matches). Then, we can find the posterior cut-off to use that results in the need for the least amount of manual data cleaning.
df_dedupe <- fastLink::getMatches(
dfA = df_unique_combo,
dfB = df_unique_combo,
fl.out = fastlink_out
)
I can't get getMatches
to work, so I made my own function:
fmr_fastlink_stack_matches <- function(.x, df) {
# ===========================================================================
# Error checks
# ===========================================================================
if (!("fastLink" %in% class(.x))) {
stop('.x must be of class "fastLink"')
}
# ===========================================================================
# Create data frame of potential matches to compare
# ===========================================================================
potential_matches <- tibble::tibble(
row = .x$matches$inds.b,
matching_row = .x$matches$inds.a,
posterior_probability = .x$posterior
)
# Keep only unique combinations of rows (e.g., not 1-3 and 3-1)
potential_matches <- potential_matches %>%
dplyr::mutate(
combo = purrr::map2_chr(
.x = row,
.y = matching_row,
.f = function(x, y) {
min <- min(x, y)
max <- max(x, y)
out <- paste(min, max, sep = "_")
out
}
),
dup = duplicated(combo)
) %>%
dplyr::filter(!dup) %>%
dplyr::select(-combo, -dup)
# ===========================================================================
# Manipulate the potential matches data frame
# Stack row and matching row on top of each other
# Add a pair number to each row and matching row
# ===========================================================================
stacked_potential_matches <- tibble::tibble(
row = c(rbind(potential_matches[["row"]], potential_matches[["matching_row"]])),
pair = rep(seq(1, length(row) / 2), each = 2),
posterior_probability = rep(potential_matches[["posterior_probability"]], each = 2)
)
# ===========================================================================
# Add substantive variables of interest to matched pairs for review
# ===========================================================================
out <- stacked_potential_matches %>%
dplyr::left_join(
df %>%
dplyr::mutate(row = dplyr::row_number()),
by = "row"
)
# ===========================================================================
# Return data frame of potential matches to compare
# ===========================================================================
out
}
matches <- fmr_fastlink_stack_matches(fastlink_out, df_unique_combo)
matches
#> # A tibble: 28 x 12
#> row pair posterior_probability incident nm_first nm_last sex birth_mnth birth_year add_num add_street group
#> <dbl> <int> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 1 1 1001 john smith m 9 1936 101 main john_smith_1936_9_101_main
#> 2 1 1 1 1001 john smith m 9 1936 101 main john_smith_1936_9_101_main
#> 3 1 2 1 1001 john smith m 9 1936 101 main john_smith_1936_9_101_main
#> 4 3 2 1 1004 jon smith m 9 1936 101 main jon_smith_1936_9_101_main
#> 5 1 3 1 1001 john smith m 9 1936 101 main john_smith_1936_9_101_main
#> 6 5 3 1 1006 joy smith f 8 1941 101 main joy_smith_1941_8_101_main
#> 7 1 4 1 1001 john smith m 9 1936 101 main john_smith_1936_9_101_main
#> 8 6 4 1 1007 michael smith m 9 1936 101 main michael_smith_1936_9_101_main
#> 9 2 5 1 1003 jane smith f 2 1937 14 elm jane_smith_1937_2_14_elm
#> 10 2 5 1 1003 jane smith f 2 1937 14 elm jane_smith_1937_2_14_elm
#> 11 2 6 1 1003 jane smith f 2 1937 14 elm jane_smith_1937_2_14_elm
#> 12 4 6 1 1005 jane smith f 3 1937 14 elm jane_smith_1937_3_14_elm
#> 13 3 7 1 1004 jon smith m 9 1936 101 main jon_smith_1936_9_101_main
#> 14 3 7 1 1004 jon smith m 9 1936 101 main jon_smith_1936_9_101_main
#> 15 3 8 1 1004 jon smith m 9 1936 101 main jon_smith_1936_9_101_main
#> 16 5 8 1 1006 joy smith f 8 1941 101 main joy_smith_1941_8_101_main
#> 17 3 9 1 1004 jon smith m 9 1936 101 main jon_smith_1936_9_101_main
#> 18 6 9 1 1007 michael smith m 9 1936 101 main michael_smith_1936_9_101_main
#> 19 4 10 1 1005 jane smith f 3 1937 14 elm jane_smith_1937_3_14_elm
#> 20 4 10 1 1005 jane smith f 3 1937 14 elm jane_smith_1937_3_14_elm
#> 21 5 11 1 1006 joy smith f 8 1941 101 main joy_smith_1941_8_101_main
#> 22 5 11 1 1006 joy smith f 8 1941 101 main joy_smith_1941_8_101_main
#> 23 5 12 1 1006 joy smith f 8 1941 101 main joy_smith_1941_8_101_main
#> 24 6 12 1 1007 michael smith m 9 1936 101 main michael_smith_1936_9_101_main
#> 25 6 13 1 1007 michael smith m 9 1936 101 main michael_smith_1936_9_101_main
#> 26 6 13 1 1007 michael smith m 9 1936 101 main michael_smith_1936_9_101_main
#> 27 7 14 1 1008 amy jones f 1 1947 1405 texas amy_jones_1947_1_1405_texas
#> 28 7 14 1 1008 amy jones f 1 1947 1405 texas amy_jones_1947_1_1405_texas
Matches we wanted:
✅ fastLink matched john and jon
✅ fastLink matched jane and jane (with birth_mnth error)
✅ fastLink matched john, jon, and michael
Matches we don't want:
🚫 fastLink matched john and joy
🚫 fastLink matched jon and joy
🚫 fastLink matched michael and joy
- We can fix the Joy matches by blocking on gender. Below I just fix them manually for the sake of expediency.
Add person id
fmr_add_unique_id <- function(df, fastLink_obj) {
# ===========================================================================
# Create tibble of matching rows
# ===========================================================================
# Turn fastLink results into tibble of rows and their matches
matches <- tibble::tibble(
row = fastLink_obj$matches$inds.b,
matching_row = fastLink_obj$matches$inds.a
)
# Nest all matches for each row
matches <- matches %>%
dplyr::group_by(row) %>%
dplyr::mutate(matches = list(matching_row)) %>%
dplyr::ungroup()
# Reduce to unique sets of matching rows
# i.e. 1,2,3 is the same as 3,2,1
matches <- matches %>%
dplyr::mutate(matches = purrr::map_chr(
matches,
function(x) {
x = sort(x) # 1,2,3 is the same as 3,2,1
x = paste(x, collapse = ",") # Convert list to character string
x
})
) %>%
dplyr::select(matches) %>%
distinct() # Reduce to 1 row per group of matches
# Sequentially number each group of matches
# This will become the unique id
matches <- matches %>%
dplyr::mutate(
id = row_number(),
row = purrr::map( # Turn back into list
matches,
~ scan(text = ., what = 0L, sep = ",", quiet = TRUE)
)
)
# Covert to data frame with the appropriate id number for each row in the
# original data set
matches <- matches %>%
tidyr::unnest() %>%
dplyr::select(id, row)
# Join id number back to original data set
out <- matches %>%
dplyr::right_join(
df %>%
dplyr::mutate(row = dplyr::row_number()),
by = "row"
) %>%
select(-row)
# ===========================================================================
# Return original data frame with unique id added
# ===========================================================================
out
}
df_unique_combo_w_id <- fmr_add_unique_id(df_unique_combo, fastlink_out)
df_unique_combo_w_id
#> # A tibble: 7 x 10
#> id incident nm_first nm_last sex birth_mnth birth_year add_num add_street group
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 1001 john smith m 9 1936 101 main john_smith_1936_9_101_main
#> 2 2 1003 jane smith f 2 1937 14 elm jane_smith_1937_2_14_elm
#> 3 1 1004 jon smith m 9 1936 101 main jon_smith_1936_9_101_main
#> 4 2 1005 jane smith f 3 1937 14 elm jane_smith_1937_3_14_elm
#> 5 1 1006 joy smith f 8 1941 101 main joy_smith_1941_8_101_main
#> 6 1 1007 michael smith m 9 1936 101 main michael_smith_1936_9_101_main
#> 7 3 1008 amy jones f 1 1947 1405 texas amy_jones_1947_1_1405_texas
✅ John, Jon, and Micheal have id == 1
✅ Jane has id == 2
✅ Amy has id == 3
🚫 Joy has id == 1
- We can fix Joy == 1 by blocking on gender. Here I just fix it manually for the sake of expediency.
df_unique_combo_w_id <- df_unique_combo_w_id %>%
mutate(id = if_else(nm_first == "joy", 4L, id))
Add id to full data
Finally, join the unique id numbers back to full data set
df_w_id <- df %>%
dplyr::left_join(
df_unique_combo_w_id %>%
dplyr::select(id, group),
by = "group") %>%
dplyr::select(id, dplyr::everything(), -group)
df_w_id
#> # A tibble: 8 x 9
#> id incident nm_first nm_last sex birth_mnth birth_year add_num add_street
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 1001 john smith m 9 1936 101 main
#> 2 1 1002 john smith m 9 1936 101 main
#> 3 2 1003 jane smith f 2 1937 14 elm
#> 4 1 1004 jon smith m 9 1936 101 main
#> 5 2 1005 jane smith f 3 1937 14 elm
#> 6 4 1006 joy smith f 8 1941 101 main
#> 7 1 1007 michael smith m 9 1936 101 main
#> 8 3 1008 amy jones f 1 1947 1405 texas
And, now I have the desired results.
Sorry for the extremely long response. I just thought that you (or others) may be interested in seeing other ways that fastLink
is being used in the wild. Also, I'd still appreciate your feedback on the Potential Solution I proposed above.
Thank you!
Brad
Hi Brad,
Thanks a lot for posting such a detailed example of how to use fastLink
to find duplicates within a dataset and create a unique ID for those observations that have been duplicated. We can only hope that fastLink
becomes of the applications you are working on.
Regarding your example, I think once you block (subset) the data by gender group, then most of the problems get solved in terms of matching the right units. Subsetting the data for males would take away the discriminatory power of the variables used to merge as first name would carry all the weight (it is the only one that varies).
Previously, we have addressed a similar question, please see issue #29. I think the data used in #29 would be a better test of your code -- for which I have little to add as it fits the problem at hand quite well.
Looking forward to hearing how fastLink
fares in your applications and other suggestions you may have.
If anything, do not hesitate to let us know.
All my best,
Ted