USGCRP/gcis-scripts

Expansion of auto-merge-people.pl

Closed this issue · 2 comments

The code at https://github.com/USGCRP/gcis-scripts/blob/master/auto-merge-people.pl is excellent. Nonetheless, it doesn't capture all cases of potential identical names. We'll need to expand it to include detection of the following cases and recognize as possibly one person:

J. Doe
John Doe
John M. Doe
John M Doe
John M. I. Doe
John MI Doe
JC Doe
J Doe
J.C. Doe
J. C. Doe # this comes up a lot
John Middle Doe
John Middle Initial Doe
JMI Doe
JM Doe

These all come up with the NCA3.

I'm elevating the priority on this somewhat since it will be easier to apply the results of such a code prior to beginning work on the HA; it will save us work in the leadup. If nothing else, we need to add the case of "J. C. Doe" since most of the undetected cases so far have that format.

For the Indicators, I use the simple algorithm: If "_" match, then a manual check should be done. If the two people have different ORC-IDs then they are unique and the manual check can be skipped. I also make sure that the check is done in all lowercase and no punctuation. E.g. "John C. O'Mally" is check as "j_omally".

Now addressed at:
#15

Issue #14 is now resolved, closed #14.