A framework for doing identity resolution at scale

Identity resolution enables organizations to link entities across a diversity of data sources, where entities can be

Customers
Products
Employees
Suppliers
Etc.

This becomes challenging when an organization cannot, for one reason or another, use a consistent set of unique keys to identify their entities of interest. In this situation, we can resort to approximate matching, using a mix of human- and machine-powered to form the linkages. In doing so, the organization can answer questions such as:

"Have I seen this entity before?"
"What activity is associated with this entity?"

The framework here uses a set of open-source packages to enable search, approximate matching and linkage. It is intended to be run on Databricks Unified Analytics Platform using Delta Lake as an open-source storage layer, but could be adapted to run on any platform running Apache Spark.

tlortz/identity-resolution-spark

A framework for doing identity resolution at scale