probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.
What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.
What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.
probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.
-
Install probablepeople
pip install probablepeople
-
Parse some names/companies!
>>> import probablepeople >>> probablepeople.parse('Mr George "Gob" Bluth II')
[('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')] >>> probablepeople.parse('Sitwell Housing Inc') [('Sitwell', 'CorporationName'), ('Housing', 'CorporationName'), ('Inc', 'CorporationLegalType')] ```
- Documentation: http://probablepeople.rtfd.org/
- Web Interface: http://parserator.datamade.us/probablepeople
- Distribution: https://pypi.python.org/pypi/probablepeople
- Repository: https://github.com/datamade/probablepeople
- Issues: https://github.com/datamade/usaddress/issues
- Blog post: http://datamade.us/blog/parse-name-or-parse-anything-really/
Probablepeople uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train probablepeople's model (a .crfsuite settings file) on labeled training data, and provides tools for easily adding new labeled training data.
git clone https://github.com/datamade/probablepeople.git
cd probablepeople
pip install -r requirements.txt
python setup.py develop
parserator train name_data/labeled/labeled.xml,name_data/labeled/company_labeled.xml probablepeople
nosetests .
If there are name/company formats that the parser isn't performing well on, you can add them to training data. As probablepeople continually learns about new cases, it will continually become smarter and more robust.
parserator label [infile] [outfile] probablepeople
For example, we have our labeled names in name_data/labeled/labeled.xml
so, you can use.
parserator label [infile] name_data/labeled/labeled.xml probablepeople
This will start a console labeling task, where you will be prompted to label raw strings via the command line. For more info on using parserator, see the parserator documentation.
If you've added new training data, you will need to re-train the model. To set multiple files as traindata, separate them with commas.
parserator train [traindata] probablepeople
for example, to train the model on both labeled names and labeled companies,
parserator train name_data/labeled/labeled.xml,name_data/labeled/company_labeled.xml probablepeople
Contribute back by sending a pull requests with your added labeled examples.
Copyright (c) 2014 Atlanta Journal Constitution. Released under the MIT License.