kosukeimai/fastLink

Looking for a way to feed threshold cutoffs to individual variables

ajw5296 opened this issue · 5 comments

Is there a way to set different cutoff values for certain variables. For instance, if the DOB variable between a potential match isn't above .9, then that wouldn't be considered a match, but all other variables have a cut off of .8.

@ajw5296 if you are using the fastLink wrapper function, it is not possible (those cutpoints are global).

If anything, let us know.

All my best,

Ted

@ajw5296 can you provide an example of what you have in mind here? Is your question about cutoff about how we compare variables or about the weight each variable receives when predicting the probability that two records are the same?

Looking forward to hearing from you!

Ted

Hey @tedenamorado, my question is more about cutoffs, and if they can be set at a variable level, more preciously

  1. Are individual matching probabilities calculated within the fastlink method
    So a match might be something like .98(fname), .98(lname), .83(dob), and then these are calculated with their weights
    for the final whole posterior

  2. Can we set threshold cut offs for those individual variables in the method or through other methods. So despite fname and lname having a high probability, we would eliminate the potential match since the dob is below .9 (the respective cut off)

I suppose this is kind of a question about weights in a way, but I think the setting a higher weight for dob is methodologically different than setting a cutoff for dob. But if setting parameters for weights is easier, I'm interested in looking into it.

And just as a note, we looked into the stringSubset method, but since DOBs are shared values, it didn't really help us much.

Let me know if I can provide more info, thanks for your help!

I do not think it is possible in fastLink other than maybe to create ad hoc linkage variables and then work directly with the corresponding gammas. A similar open issue is #49.

The Python-based splink has a similar open issue moj-analytical-services/splink#434. The proprietary Match*Pro has "Classification Tab" with a user-friendly GUI for creating similar deterministic criteria.

For what it is worth, to me this seems of little use compared with other promised features under development such as probabilistic blocking and active learning.

Hi @ajw5296,

As @aalexandersson mentions, it is not possible to set deterministic rules based on the probability of observing a specific agreement value for field k given that a pair of records is a match. The model learns these probabilities from the data.

Our focus is on the Probability that a pair of records is a match given the agreement pattern and the parameters of the model, which is a composite measure of the field-specific probabilities of observing an agreement value given that a pair of records is a match.

However, an alternative would be to pass your own set of parameters to fastLink. For example, we discuss how to pass parameters from a random sample of observations to a larger dataset here.

Please, if you feel we can be of further assistance, let us know.

All my best,

Ted