This project processes eCommerce behavior data for customer insights, following the Medallion Architecture (Bronze, Silver, Gold).
- dags/: Contains Airflow DAGs for orchestrating each pipeline stage.
- data/: Holds data in Bronze, Silver, and Gold layers for raw, processed, and aggregated data.
- customer_behaviour/functions/: Core modules for each ETL stage:
- bronze_ingestion.py: Ingestion functions for Bronze layer.
- silver_transformation.py: Transformation functions for Silver layer.
- gold_aggregations.py: Aggregations for analytics-ready data at Gold layer.
- bigquery_loading.py: Loading functions to push data to BigQuery.
- schema.py: Schema validation functions.
- session.py: Session management functions (e.g., Spark session setup).
- customer_behaviour/jobs/: Contains jobs for each ETL stage (ingest, transform, aggregate).
- tests/: Unit and integration tests for pipeline components.
-
Set Up Environment:
make setup
This installs dependencies using Poetry and sets up
direnv
. -
Run Tests:
make test
This runs unit and integration tests to ensure functionality.
-
Lint Code:
make lint
Lints the code to check for syntax and style issues.
The pipeline follows the Medallion Architecture:
- Bronze Layer: Raw data ingestion via
bronze_ingestion.py
. - Silver Layer: Data cleaning and transformation via
silver_transformation.py
. - Gold Layer: Data aggregation for analytics via
gold_aggregations.py
. - BigQuery Loading: Final loading into BigQuery for BI tools.
Build and run the container with Docker Compose:
docker-compose up --build