/identity-resolution-spark

A framework for matching entities across datasets that don't have explicit key-based linkage, at scale using Apache Spark and other tools

Primary LanguagePython

A framework for doing identity resolution at scale

Identity resolution enables organizations to link entities across a diversity of data sources, where entities can be

  • Customers
  • Products
  • Employees
  • Suppliers
  • Etc.

This becomes challenging when an organization cannot, for one reason or another, use a consistent set of unique keys to identify their entities of interest. In this situation, we can resort to approximate matching, using a mix of human- and machine-powered to form the linkages. In doing so, the organization can answer questions such as:

  • "Have I seen this entity before?"
  • "What activity is associated with this entity?"

The framework here uses a set of open-source packages to enable search, approximate matching and linkage. It is intended to be run on Databricks Unified Analytics Platform using Delta Lake as an open-source storage layer, but could be adapted to run on any platform running Apache Spark.