Code for the blog: How to implement data quality checks with greatexpectations
Our project involves building a dim_customer
table from raw_customer
and raw_state
tables. We use sqlite3
as our data warehouse. The data flow architecture is pictured below:
flowchart TD
A[raw_state] -->|transform| B0(non_validated_base_state)
B0 --> B1{audit}
B1 --> |Pass|B[base_state]
B1 --> |Fail|A1[Raise Alert]
C[raw_customer] -->|transform| D0(non_validated_base_customer)
D0 --> D1{audit}
D1 --> |Pass|D[base_customer]
D1 --> |Fail|A11[Raise Alert]
B --> E0(non_validated_dim_customer)
D --> E0
E0 --> E1{audit}
E1 --> |Pass|F[dim_customer]
E1 --> |Fail|A12[Raise Alert]
F --> S[Success]
You can run this via
Simply click on this link and you will be able to run your own code space with this repository. Wait for a few minutes for codespaces to install all the packages in requirements.txt.
❗ Caution: Codespaces only have limited free availability
You can clone this repo and setup a virtual environment to run the code. You will need Python >= 3.10 and git installed.
git clone https://github.com/josephmachado/data-quality-w-greatexpectations.git
cd data-quality-w-greatexpectations
python -m venv ./env
source env/bin/activate # use virtual environment
pip install -r requirements.txt
Once you have the environment ready (either via codespaces or locally), run the setup script that creates tables, and inserts data into them.
# in the data-quality-w-greatexpectations folder
./setup.sh
With the tables setup, you can run the ETL with the following command:
# in the data-quality-w-greatexpectations folder
python ecommerce/ecommerce/dim_customer_etl.py
This will run the ETL along with the greatexpectations validations.
You can check that the ETL output is present with the following command.
sqlite3 ecommerce.db < ./setup/3-check-count.sql
The above query should return
110
110
27
The results of the validations will be stored in the ./ecommerce/ecommerce/gx/uncommitted/validations/dim_customer_dt_created_count/__none__
.
Since the uncommitted
folder is not included in the git repo, you will need to run the ETL atleast once for this folder to appear.