/groupbyrule

Deduplicate data using fuzzy and deterministic matching rules.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

test

🔗 GroupByRule: deduplicate data using fuzzy and deterministic matching rules

🚧 under construction 🚧

GroupByRule is a Python package for data cleaning and deduplication. It integrates with pandas' groupby function to not only group dataframe rows by a given identifier, but also groups rows based on logical rules and partial matching. In other words, it provides tools for deterministic record linkage and entity resolution in structured databases. It can also be used for blocking, a form of filtering used to speed-up more complex entity resolution algorithms. See the references below to learn more about these topics.

One of the main goal of GroupByRule is to be user-friendly. Matching rules and clustering algorithms are composable and the performance of algorithms can be readily evaluated given training data. The package is built on top of pandas for data manipulation and on igraph for graph clustering and related computations.

Additionally, GroupByRule provides highly efficient C++ implementations of common string distance functions through its comparator submodule. This can be used independently of record linkage algorithms.

Installation

Install from github using the following command:

 pip install git+https://github.com/OlivierBinette/groupbyrule.git

Examples

Rule-Based Linkage

Consider the RLdata500 dataset from the RecordLinkage R package.

from groupbyrule.data import load_RLdata500

df = load_RLdata500()
df.head()
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd identity
1 CARSTEN NaN MEIER NaN 1949 7 22 34
2 GERD NaN BAUER NaN 1968 7 27 51
3 ROBERT NaN HARTMANN NaN 1930 4 30 115
4 STEFAN NaN WOLFF NaN 1957 9 2 189
5 RALF NaN KRUEGER NaN 1966 1 13 72

We deduplicate this dataset by linking records which match either on both first name (fname_c1) and last name (lname_c1), on both first name and birth day (bd), or on both last name and birth day. Linkage transitivity is resolved, by default, by considering connected components of the resulting graph. Precision and recall are computed from the ground truth membership vector identity_RLdata500.

from groupbyrule import Any, Match, precision_recall

# Specify linkage rule
rule = Any(Match("fname_c1", "lname_c1"),
           Match("fname_c1", "bd"),
           Match("lname_c1", "bd"))

# Apply the rule to a dataset
rule.fit(df)

# Evaluate performance by computing precision and recall
precision_recall(rule.groups, df.identity)
(0.11538461538461539, 0.96)

This is not the best way to deduplicate this dataset, but the above showcases the composability of matching rules. The specific rules themselves (exact matching, similarity-based string matching, and different clustering algorithms) can be customized as needed. A more complete overview is available here 🚧.

A better way to deduplicate this data is to link all pairs of records which agree on all but at most one attribute. This is done below.

from groupbyrule import AllButK

# Link records agreeing on all but at most k=1 of the specified attributes
rule = AllButK("fname_c1", "lname_c1", "bd", "bm", "by", k=1)

# Apply the rule to a dataset
rule.fit(df)

# Evaluate performance by computing precision and recall
precision_recall(rule.groups, df.identity)
(1.0, 0.92)

Postprocessing

Following record linkage, records can be processed using pandas's groupby and aggregation functions. Below, we only keep the first non-NA attribute value for each record cluster. This is a simple way to obtain a deduplicated dataset.

deduplicated = df.groupby(rule.groups).first()

String Distance Functions

GroupByRule provides a suite of string and numerical similarity functions as part of its comparator submodule. String similarity functions include the Levenshtein distance, the Jaro-Winkler distance, and 🚧. These similarity functions can be used on their own as shown below, or for the definition of linkage rules as explained in the following section. This is heavily inspired by Neil Marchant's excellent Comparator R package, but not quite equivalent in its scope and implementation.

String distance functions are implemented through subclasses of the Comparator abstract base case. Comparator objects are used to instanciate comparison functions while allowing data in memory to be recycled across function calls. The compare() method can then be used to compare elements, the pairwise() method can be used to compare all pairs of elements between two lists, and the elementwise() method can be used to compare corresponding elements.

Below are examples of the comparison functions currently provided. These are implemented in C++ for efficiency.

Levenshtein Distance

from groupbyrule.comparator import Levenshtein

cmp = Levenshtein(normalize=True, similarity=False)
cmp.compare("Olivier", "Oilvier")
0.25

Longest Common Subsequence (LCS) Distance

from groupbyrule.comparator import LCSDistance

cmp = LCSDistance(normalize=True, similarity=False)
cmp.compare("Olivier", "Oilvier")
0.25

Damerau-Levenshtein Distance

from groupbyrule.comparator import DamerauLevenshtein

cmp = DamerauLevenshtein(normalize=True, similarity=False)
cmp.compare("Olivier", "Oilvier")
0.13333333333333333

Jaro Distance

from groupbyrule.comparator import Jaro

cmp = Jaro(similarity=False)
cmp.compare("Olivier", "Oilvier")
0.04761904761904756

Jaro-Winkler Distance

from groupbyrule.comparator import JaroWinkler

cmp = JaroWinkler(similarity=False)
cmp.compare("Olivier", "Oilvier")
0.042857142857142816

Similarity-Based Linkage Rules

🚧

Supervised Approaches and Learning Rules

🚧

Clustering Algorithms

🚧

Performance Evaluation

🚧

Privacy-Preserving Record Linkage

🚧

Cryptographic Primitives

🚧

Multiparty Computation Protocols

🚧

References

🚧