/data_quality_databricks

Examples of metadata driven SQL processes implemented in Databricks

Primary LanguageHTML

Examples of data quality processes implemeted in Databricks

This repository contains a collection of Databricks notebooks that demonstrate configurable data quality processes that can be implemented in Databricks using python and SQL.

The processes detailed in this repository are related to data quality and data product management, they include methods for automating the maintenance of a data dictionary, refining a data model (comments and column positions), executing data quality tests, blocking bad quality data and value mapping.

The repository contains a html version of each notebook that can be viewed in a browser and a dbc archive that can be imported into a Databricks workspace. Execute Run All on the notebooks in their numebered order to reproduce the demo in your own workspace.

Notebooks

  1. Create sample data using Databricks data sets.
  2. Create data dictionary tables.
  3. Update data dictionaries using metastore data4. Refine data model.
  4. Comment and reorder columns
  5. Configuring data quality tests.
  6. Executing data quality tests.
  7. Blocking bad quality data
  8. Mapping local values to global ones
  9. Clean up (drop all tables created during demo).