/CleanCorp

Corpus based business entity reconciliation.

Primary LanguagePythonMIT LicenseMIT

CleanCorp - clean business names

Corpus based business entity reconciliation.

Description

This Python package processes company names, provides a clean version of the business name by stripping away terms indicating organization type (such as "Ltd." or "Corp") and attempts to extract the root name of the company by removing industry markers (such as "Metals", "Solar", "Banking", etc ...).

Using an organization type to suffix database, it also provides an utility to deduce the type of organization, in terms of US/UK business entity types (ie. "limited liability company" or "non-profit").

Furthermore, thanks to a multilingual annotated corpus, terms such as "Analytics" or "Accounting" are matched to their respective industries ('Technology', 'Financial Services').

Finally, the system uses the suffix' information to suggest countries the organization could be established in. For example, the term "Oy" in a company's name it suggests it is established in Finland, whereas "Ltd" in company name could mean UK, US or a number of other countries.

Disclamer

This package is not a replacement for a name reconcialiation service such as OpenCorporate's but rather a helper library to gain insights on un-reconciliable data, like the Panama Papers or other offshore leaks.

Milage may vary depending on typos and formating. You still have to clean your data... sorry.

Install

pip3 install git+https://github.com/Syker.uk/cleancorp.git

Code

>>> from cleancorp import CleanCorp

Initialize the instance with your business string:

>>> business_name = "Some Big Pharma, LLC"
>>> x = CleanCorp(business_name)
>>> print(x)
CleanCorp([Some Big Pharma, LLC])

You can now access CleanCorp's properties and attributes:

 >>> x.is_company()
True

>>> x.entity_type
['Limited Liability Company']

>>> x.industry
['Health Care']

>>> x.country
['United States of America', 'Philippines']

>>> x.clean_name
'Some Big Pharma'

>>> x.root_name
'Some Big'

 >>> x.as_dict()
{
    'is_company' : True,
    'original_name' : 'Some Big Pharma, LLC', 
    'clean_name' : 'Some Big Pharma', 
    'root_name' : 'Some Big', 
    'entity_type' : ['Limited Liability Company'], 
    'industry' : ['Health Care'], 
    'country' : ['United States of America', 'Philippines']
}

Test

Run the test.ipynb notebook.

The Data:

Suffix to Country to Entity Type data was compiled with OpenRefine from multiple sources, notably:

Test data was provided by:

The corpus of industry terms was compiled from kaggle's 7+ Million Company Dataset.

Credits