/probablepeople

a python library for parsing unstructured western names into name components.

Primary LanguagePythonMIT LicenseMIT

probablepeople

Build Status

probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.

probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.

How to use probablepeople

  1. Install probablepeople

    pip install probablepeople  
    
  2. Parse some names/companies!

    >>> import probablepeople  
    >>> probablepeople.parse('Mr George "Gob" Bluth II')  
    

[('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')] >>> probablepeople.parse('Sitwell Housing Inc') [('Sitwell', 'CorporationName'), ('Housing', 'CorporationName'), ('Inc', 'CorporationLegalType')] ```

Links:

For the nerds:

Probablepeople uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train probablepeople's model (a .crfsuite settings file) on labeled training data, and provides tools for easily adding new labeled training data.

Building & testing development code

git clone https://github.com/datamade/probablepeople.git  
cd probablepeople  
pip install -r requirements.txt  
python setup.py develop
parserator train name_data/labeled/labeled.xml,name_data/labeled/company_labeled.xml probablepeople
nosetests .  

Creating/adding labeled training data (.xml outfile) from unlabeled raw data (.csv infile)

If there are name/company formats that the parser isn't performing well on, you can add them to training data. As probablepeople continually learns about new cases, it will continually become smarter and more robust.

parserator label [infile] [outfile] probablepeople  

For example, we have our labeled names in name_data/labeled/labeled.xml so, you can use.

parserator label [infile] name_data/labeled/labeled.xml probablepeople  

This will start a console labeling task, where you will be prompted to label raw strings via the command line. For more info on using parserator, see the parserator documentation.

Re-training the model

If you've added new training data, you will need to re-train the model. To set multiple files as traindata, separate them with commas.

parserator train [traindata] probablepeople  

for example, to train the model on both labeled names and labeled companies,

parserator train name_data/labeled/labeled.xml,name_data/labeled/company_labeled.xml probablepeople  

Contribute back by sending a pull requests with your added labeled examples.

Copyright

Copyright (c) 2014 Atlanta Journal Constitution. Released under the MIT License.