tupilabs/HumanNameParser.java

Support custom postnominals

JamoCA opened this issue · 4 comments

Could support be added to define custom postnominals?

I'm parsing some real-world data and encountering some medical suffixes that I'm having to write extra rules outside of this library.

Here are samples of what I've identified so far with a small sample of about ~100 records provided by a client.

  • Au.D.
  • M.S.
  • M.A.
  • M.Ed.

Thanks.

kinow commented

Hi @JamoCA

That's a good idea I think.

I will have to take a look at the library code to refresh my memory, but maybe we could have separate sets of suffixes for things like IT suffixes, medical suffixes, engineering, etc. What do you think? Do you have any suggestion on how that should be implemented?

Do you know of a list for these suffixes? Here's a website of a local clinic that contains a list of its medical staff: https://www.connectmed.co.nz/practice/pitt-street-medical. Are the suffixes in your data similar to those values like "MB ChB, Dip Obst, DCH, M, F RNZCGP" ?

I'm not entirely sure how to best implement it, but adding support for some of the more popular ones which can't be confused with last names would be beneficial. I'm not very adept with Java. (My primary development language is ColdFusion which uses Java to perform JIT compilation of CFML to class files.)

A lot of the industry-specific data that I was provided with was inconsistent and had different capitalization, spacing & period usage. I was initially just title casing the "full name" value that also contained title & suffixes, but then realized that I needed to parse it and remembered that this library was available. I added HNP to the import process, but it was incorrectly parsing some of the suffixes as last names. I wrote a CFML-based wrapper component to:

  • title case entire string (HNP doesn't modify case)
  • identify known suffixes
  • strip identified suffixes
  • parse using HumanNameParse
  • reintegrate stripped suffixes w/proper case

I also wanted to ensure standardization among the suffixes... while "MD' was identified as a suffix, if "M.D." is used, it was parsed as the last name. (Standard abbreviations usually have periods. The APA Publication Manual recommends not using periods with degrees while other reference manuals recommend using periods.) For suffix identification via regex, I've been using something like (?i)\bM\.*D\.*\b.

In mixed-case usage, I'm reformatting suffixes like "AUD, AU.D, AU.D." and "PH.D, PHD, PH.D." as Au.D. and Ph.D.. I use a regex to look for variations of it using optional periods and word boundaries and then replace with the proper-case version. (Uppercasing afterwards is easy; proper-casing is not.)

I found some other rules here which state all sources advise against using titles before and after a name at the same time. An example of this is using "Dr." at the beginning while ending with "MD"... but some of the real-world data I've encountered use it in both places. (I guess these doctors didn't get the memo.) I'm not entirely sure what to do in these cases. Sometimes "MD" is an abbreviation for "Managing Director", so maybe have suffixes segregated by industry type would be beneficial with an ability to "use all but prioritize".
http://guidetogrammar.org/grammar/abbreviations.htm

"FAAA" was one of the suffixes that I came across. I didn't know what it meant and went to https://www.acronymfinder.com/ to find out. Apparently it's "Fellow of the American Academy of Audiology". There are many acronyms not related to names. The site doesn't appear to have a category solely dedicated to names.

I also encountered a nuanced sort order regarding how some postnominals are displayed in the industry "medical; audiology". (ie, M.A. is apparently more important than Au.D..) As a result, I'm using an array to extract suffixes using a semantic hierarchy and then output them in the order detected rather than the initial provided string so that the results are consistent across all name strings.

Here's a short example of the rules I'm using so far: (NOTE: I had to write a couple different rules for Au.D. so that it didn't match too early and leave an extra period.)

rules = [ 
	["Ph.D.", ["(?i)\bPh\.*D\.*\b"]],
	["M.A.", ["(?i)\bM\.*A\.*\b"]],
	["Au.D.", ["(?i)\bAU\.D\.", "(?i)\bAUD\b", "(?i)\bAU\.D\b"]],
	["M.S.", ["(?i)\bM\.*S\.*\b"]],
	["M.D.", ["(?i)\bM\.*D\.*\b"]],
	["CCCC-A", ["(?i)\bCCC\-A"]],
	["MSCCCA", ["(?i)\bMSCCCA\b"]],
	["FAAA", ["(?i)\bFAAA\b"]]
];
kinow commented

I've cut a 0.2 release with the new builder. Feel free to re-open or open a new issue in case it doesn't completely solve your use case @JamoCA . Thanks!