How to leverage the power of Databricks notebooks and GX data quality checks to create validated data workflows
Full instructions available at the Great Expectations Blog.
As this is a fairly detailed integration of multiple tools, some working knowledge of Python, SQL, Git, and Databricks is assumed. Prior experience with Great Expecations (GX) could prove useful, but is not strictly required.
-
A Databricks account and a Workspace setup on a supported cloud provider (AWS, Azure, GCP).
-
A compute cluster running Databricks Runtime 12.0 or higher.
-
A Git account with a Git provider supported by Databricks.
-
Databricks for Git Repos configured.
-
A Google Cloud Platform (GCP) project account and ability to create a project and generate a service account with API credentials.
- Directories
- dotfiles (e.g.
.gitignore
) - GCP Service Account credentials JSON file
- repo config YAML file
- Pandas DataFrame PKL file with some sample data
Databricks notebooks with executable code for scheduled orchestration
Python files (not databricks notebooks!) to be imported
Anything pertaining to data validation with GX
The default contents of the config.yml
file are shown below:
# assumed directory structure is: /Workspace/repos/{repo_directory}/{repo_name}
# {repo_directory} in assumed directory structure
repo_directory: "dev"
# {repo_name} in assumed directory structure
repo_name: "gx-databricks-bigquery-public"
# relative path of BigQuery service account credentials file
bigquery_creds_file: ".bigquery_service_account_creds.json"
# relative path of great expectations directory
gx_dir: "great_expectations"
# provide a name to help identify the GX data connector type
gx_connector_name: "pandas_fluent"
To avoid instances of FileNotFoundError
and other problems, it's suggested to:
- Not rename or move the
config.yml
file out of the top-level directory. - Only use one
key: 'value'
pair per line. - Maintain a directory structure of
/Workspace/Repos/{repo_directory}/{repo_name}
. If you deviate from this pattern, the helper functions in the repo may not be able to locate files correctly.