/senzing-databricks

Senzing integration in Databricks Spark workflows

Primary LanguageJupyter NotebookMIT LicenseMIT

senzing-databricks

Senzing integration in Databricks Spark workflows.

Tutorial components

This repository contains two Jupyter notebooks which demonstrate Senzing entity resolution integrated with Apache Spark from Databricks:

spark_quickstart.ipynb:

An introductory tutorial showing how to integrate Senzing with Spark DataFrames for running entity resolution in a batch mode. This covers loading multiple datasets, configuring Senzing, processing records, and enriching DataFrames with entity resolution results.

spark_streaming.ipynb: Demonstrates real-time entity resolution using Spark Structured Streaming. This shows how to process streaming data through Spark and send PII features to Senzing for continuous entity resolution, simulating a real-time data processing pipeline.

Both tutorials require a Docker container running for the Senzing gRPC server whenever you run the Jupyter notebooks:

Start with the spark_quickstart.ipynb tutorial for detailed explanations of the core concepts.

Requirements

These tutorials were developed on MacOS, though they should run fine on Linux as well.

Platform requirements are:

  • Python 3.13
  • Spark requires Java 17 or 21

You can find more details about the Java requirements in the Apache Spark documentation.

To set up the Python environment:

python3 -m venv venv
source venv/bin/activate

python3 -m pip install -U pip wheel
python3 -m pip install -r requirements.txt

Alternatively use the requirements.txt file with uv, poetry, or your favorite Python dependency manager.

You also need to pull the latest Docker container for Senzing:

docker pull senzing/serve-grpc:latest

Then launch this container and have it running in the background:

docker run -it --publish 8261:8261 --rm senzing/serve-grpc

Then launch Jupyter to run the notebooks:

./venv/bin/jupyter-lab

Troubleshooting

Be aware that running pip install senzing without specifying a version number will not give you the correct version. Make sure to use the versions specified in the requirements.txt file.

If you experience gRPC errors, these may be due to an outdated Docker container.

Kudos

Kudos for their help with this tutorial: @brianmacy, @docktermj