kosukeimai/fastLink

Q: Database size limit for duplicate removal?

gbdias opened this issue ยท 15 comments

Hi,

I am trying to perform deduplication on a database with 1.8M records. The analysis has been running for ~10 days on a 8-core machine with 32Gb RAM. Do you believe this task can be achieved on such a machine or do I need a bigger server?

My command is as follows:

fl_pac_dedup <- fastlink(
    dfA = pacientes_clean, dfB = pacientes_clean,
    varnames = c("pacient_name", "mother_name", "birth_date"),
    stringdist.match = c("pacient_name", "mother_name", "birth_date"),
    dedupe.matches = FALSE, return.all = FALSE
)

Best,
Gui

Hi,

I do not think it is a problem with the specs of your machine, but trying to conduct the merge without some blocking. From your code, I read that you only have three fields to match and the first two refer to names.

My first recommendation would be to use the birth_date to block. For example, you can create subsets of the data by year of birth and conduct the merge within each subset. Note that if you have many missing values on that field, then this suggestion may not be ideal.

Quick question: have the names been separated into components? For example, first name, middle name, and last name?

Keep us posted!

Ted

Thanks for the quick reply @tedenamorado !

  • I have not separated the names by component, do you think that is a better way to go about this task?

  • I was initially unsure about blocking since there might be typos in the birth_date field, but If it's what it takes to get the job to complete I'll just give it a try. :)

Gui

Hi Gui,

Yes, I think parsing may help. Imagine if you have in one file the name: Ted Enamorado and in another file the name Ted Enamorado Enamorado. By glancing at both names you know they are the same, but when calculating a string similarity measure like Jaro-Winkle you find that on a scale from 0 (different) to 1 (equal), both strings are 0.85 similar. The default threshold of similarity for strings for fastLink is set higher than that number, so both names would be considered to be different.

When working with dates, I usually transform them in the number of days to a specific date in the future and then divide that by 365.25 to get a transformation of a date into a yearly unit scale. For example,

as.numeric((as.Date("2022-10-10") - as.Date("1997-09-25"))/325.25)

Such a transformation allows you to incorporate all the components of a date into one number and you can use the numeric.match options of fastLink to compare the resulting numbers. Thus, my hope is that subsetting the data by year and then comparing this new numeric variable within each subset could work well in your setting.

All my best,

Ted

Hi Ted,

I'm giving your strategy a try and had a follow up question.

Blocking by year of birth results in a large number of blocks. How should I go about running deduplication on each block and merging results back into a single, deduplicated dataframe?

I know this is a general R question and I can probably come up with a method but maybe you already have an efficient routine developed? I found this post #63 (comment) may do what I need, after replacing the general fastLink call with a deduplication call. Do you agree?

Thanks,
Gui

Hi Gui,

If you are running this on just one computer, the approach you mention would work. For deduplication, I wrote a short primer on how to do it, you can access it here: https://www.dropbox.com/s/jhob3yz0594bo34/Enamorado_Census.pdf?raw=1

If you have access to a cluster computer, then each block becomes a job and you can distribute the blocks.

Hope this helps!

Ted

Hi TED,

Since I do not have access to a cluster for this job I am using a simple loop to process each block serially.

Runtime seems within reason but I am running into memory limitations. I initially used 6 CPUs, now I am down to 2 to see if the memory bottleneck goes away. Any tips on that would be appreciated.

Best,
Gui

Hi Gui,

If you are running into memory limitations, it can be one of two things:

  1. Some of the blocks are really large.
  2. There is a lot of overhead in terms of memory.

If the problem is 2, then you can solve it by saving (to disk) the matched datasets at each loop. Removing those objects after saving using the rm() function and then using garbage collection gc() to free up some RAM.

If using the tricks above does not help, I think the next step is to check which blocks are too large and subset those one step further (if possible).

Keep us posted!

Ted

Hi Gui,

How many blocks do you have? Is the problem always for the same block or does it vary?

Do you use a Windows OS? Then, I also recommend a restart (not shutting down) to free up all RAM before the deduplication. It typically is not necessary but I have seen memory-related problems sometimes disappear by restarting.

Anders

Hi @tedenamorado and @aalexandersson ,

Thank you very much for the advice.

  • I have about ~120 blocks in this deduplication dataset.
  • The largest blocks have ~50,000 records.
  • I have not monitored if the same block is causing the memory limitation.
  • I have access to both Windows (i7-6700 32gb RAM) and MacOS (Apple M1 16gb RAM) machines.

I will implement @tedenamorado 's cleanup strategy with rm() and gc() and see if this solves it. Will also restart the system just to make sure I don't have too many processes taking up RAM.

Thanks again \o/
Gui

That is relatively many blocks compared with what I usually use. Therefore, I am curious could the issue be the opposite? That is, is the number of observations in the smallest block smaller than the number of blocks? It would have generated an error for a record linkage (on 2 datasets) but I am not sure how fastLink handles this issue for deduplication.

What is the runtime? Do you get an error message? How many observations does the smallest block have?

Dear @aalexandersson ,

  • Indeed there are blocks where the number of observations is shorter than the total number of blocks (e.g. n=1). Here is the histogram of block lengths (in number of records).

Screen Shot 2022-10-24 at 10 23 27

> summary(block_length)
         Min.       1st Qu.        Median          Mean       3rd Qu. 
    1.0000000   391.5000000  8204.0000000 14745.1300813 31667.5000000 
         Max. 
43764.0000000
  • I am blocking by year of birth, so there might actually be some years where only a single individual from this database was born.

Screen Shot 2022-10-24 at 10 29 45

  • The runtime for my previous tries was several days (maybe 10) before the machine started freezing up and I stopped it.
  • Do you think that removing the blocks that contain too few observations could help?

Yes, I think that it will help to remove the blocks that contain observations fewer than the block number.

For record linkage, when using this code for a block with fewer observations than the block number

for (i in 1:length(block_out)){

I get this error message:

Error in { : task 1 failed - "error in evaluating the argument 'x' in selecting a method for function 'which': subscript out of bounds"

I suspect that fastLink has a similar problem when used for deduplication with the additional issue of not providing an error message. Please remove the blocks with fewer observations than the block number, then try again and report back if it solves the issue of R freezing.

Also, blocking with enough observations will not solve any scalability issue. Therefore, you may want to first work with a much smaller deduplication dataset, for example 18,000 records (1% of your 1.8M) to test that your fastlink code with an R for loop for blocking runs and produces a correct result.

Then, once you trust your code, you can scale up to using the full dataset.

  • I tested my loop on a small, 38k record, dataset and it works quickly and correctly.
  • Linking the 1.8M and the 38k databases also runs relatively quickly using my loop.
  • What I really need to tune is the memory issue for the 1.8M deduplication step. Once I have the time to implement your suggestions I will summarize the outcomes here. Thanks a lot

Hi Gui,

Another idea that could work for those records with a birth year before 1940 is to pack them in just one cluster.

Keep us posted on how it goes!

Ted