HMNI

Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.

HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.

Model	Accuracy	Precision	Recall	F1-Score
HMNI-Latin	0.9393	0.9255	0.7548	0.8315

For an introduction to the methodology and research behind HMNI, please refer to my blog post.

Requirements

Python 3.5–3.8

tensorflow
scikit-learn
fuzzywuzzy
abydos
unidecode

QUICK USAGE GUIDE

Installation

Using PIP via PyPI

pip install hmni

Fix deprecated imports from numpy and collections

In order to resolve import errors from numpy and collections when importing hmni, import hmni as follows:

import numpy
numpy.float = float
numpy.int = int
import collections
collections.Iterable = collections.abc.Iterable
import hmni

Initialize a Matcher Object

import hmni
matcher = hmni.Matcher(model='latin')

Single Pair Similarity

matcher.similarity('Alan', 'Al')
# 0.6838303319889133

matcher.similarity('Alan', 'Al', prob=False)
# 1

matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133

Record Linkage

import pandas as pd

df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})

merged = matcher.fuzzymerge(df1, df2, how='left', on='name')

Name Deduplication and Normalization

names_list = ['Alan', 'Al', 'Al', 'James']

matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']

matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']

matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']

Matcher Parameters

hmni.Matcher(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)

model (str) -- HMNI statistical model (latin by default)
prefilter (bool) -- Should the matcher prefilter unlikely candidates (True by default)
allow_alt_surname (bool) -- Should the matcher consider phonetic matching surnames e.g. Smith, Schmidt (True by default)
allow_initials (bool) -- Should the matcher consider names with initials (True by default)
allow_missing_components (bool) -- Should the matcher consider names with missing components (True by default)

Matcher Methods

similarity(name_a, name_b, prob=True, surname_first=False)

name_a (str) -- First name for comparison
name_b (str) -- Second name for comparison
prob (bool) -- If True return a predicted probability, else binary class label
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
surname_first (bool) -- If name strings start with surname (False by default)

fuzzymerge(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)

df1 (pandas DataFrame or named Series) -- First/Left object to merge with
df2 (pandas DataFrame or named Series) -- Second/Right object to merge with
how (str) -- Type of merge to be performed
- inner (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
- left: Use only keys from left frame, similar to a SQL left outer join; preserve key order
- right: Use only keys from right frame, similar to a SQL right outer join; preserve key order
- outer: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
on (label or list) -- Column or index level names to join on. These must be found in both DataFrames
left_on (label or list) -- Column or index level names to join on in the left DataFrame
right_on (label or list) -- Column or index level names to join on in the right DataFrame
indicator (bool) -- If True, adds a column to output DataFrame called “_merge” with information on the source of each row (False by default)
limit (int) -- Top number of name matches to consider (1 by default)
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
allow_exact_matches (bool) -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
surname_first (bool) -- If name strings start with surname (False by default)

dedupe(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)

names (list) -- List of names to dedupe
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
keep (str) -- Specifies method for keeping one of multiple alternative names
- longest (default): Keeps longest name
- frequent: Keeps most frequent name in names list
reverse (bool) -- If True will sort matches descending order, else ascending (True by default)
limit (int) -- Top number of name matches to consider (3 by default)
replace (bool) -- If True return normalized name list, else return deduplicated name list (False by default)
surname_first (bool) -- If name strings start with surname (False by default)

assign_similarity(name_a, name_b, score)

name_a (str) -- First name for similarity score assignment
name_b (str) -- Second name for similarity score assignment
score (float) -- Assigned similarity score for pair of names

Contributing

Pull requests are welcome. For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), jupyter notebooks are shared in the dev folder to build models using similar methods.

License

MIT

ajd12342/hmni