Name cleanup for display vs. name tokenization
paultyng opened this issue · 2 comments
This library is great for name display, but would it be worth pursuing something for solely computer based name recognition. Like tokenize, sort alphabetically, and use that as a token so that (too naive of an algorithm, but just an example):
- John Smith
- Smith John
- Smith, John Sr.
- etc.
Could all be potentially recognized as similar entities by the computer, even if the display of that 2nd one may not actually come through properly since its missing a comma?
I had a lot of similar problems parsing data for Wikipedia automation and this was one way I tackled the name cleanup, separating the computer recognition of the same person from the display cleanup.
Thus far, I have mostly attempted to keep any algorithms specific to matching tasks (vs display tasks) out of Name Cleaver, but any auxiliary methods would be welcome. Name Cleaver already performs tokenization, and in fact, it has a method, primary_name_parts(), which returns a list of first and last names for human names, so you'd be just a sort() away from the particular process you describe. We could certainly work on documentation to advertise this fact.
You might be interested in a name matching mini-framework which is built on top of Name Cleaver, and resides in our datacommons code base. It has been on my someday-do list to break it out into its own library (possibly as a Django app, since it currently relies on Django's ORM), so it could be better shared with the community. The meat of it is in the base classes, and instances of our past use cases for it reside in the commands directory.
https://github.com/sunlightlabs/datacommons/tree/master/dcentity/matching
Of course, these sorts of heuristic approaches are never going to be quite as smart as a system which uses statistics to assess the probability that two names are the same individual. We also try hard to avoid false positives more than false negatives, so currently none of our matching scripts would take care of your second example where "Smith John" == "John Smith." Even a human, absent more context could know if "Henry Paul" should be "Paul Henry." But Name Cleaver could certainly help you make such a match if you wish.
Yeah as I mentioned too naive of an algorithm on its own. And wouldn't handle matching "Mr. Smith" and "John Smith" either or common misspellings. Some sort of context driven approach would be best. Or maybe this is more of a job for Mechanical Turk :)
Thanks for the pointers to the other library, will take a look.