Transit Service Review | Advanced Analytics | DBT Docs
A Scalable Data Platform Implementation
Developed with the software and tools below.
This project is an implementation of a scalable, near real-time data platform built on top of the Open Data Platform Mobility Switzerland (ODPMS) API.
Key components are seamlessly integrated to ensure efficient data ingestion, transformation, and visualization. The data pipeline initiates with extracting and loading data into Snowflake using DLT. Transformations are then carried out using dbt to create final data models, ready for near real-time visualization in PowerBI dashboards. The entire ELT process is orchestrated by Airflow running on an AKS cluster. Infrastructure is managed as code with Terraform, and the CI/CD pipeline is streamlined with pre-commit hooks and GitHub Actions workflows, ensuring continuous integration and delivery.
Here are a few use cases implemented to demonstrate the practical applications of the data platform.
A Machine Learning model that aims to answer the question: "When should I leave to arrive at my destination before a specific time with a given probability, given the recent state of the transit network?"
Feature | Description |
---|---|
Modular Architecture | The project is organized into distinct modules: extract_load, transform, orchestrate, infrastructure, and CICD, each handling a specific part of the data pipeline. |
Data Ingestion | Utilizes the DLT (Data Load Tool) Python library for robust data extraction and loading from the ODPMS API to Snowflake. |
Data Transformation | Uses dbt for data modeling and transformation within Snowflake, ensuring data integrity and consistency. |
ELT Orchestration | Apache Airflow running on an AKS cluster manages and automates the ELT processes, ensuring reliability and scalability. |
Infrastructure Management | Infrastructure as code is handled with Terraform, allowing for easy deployment, updates, and scaling. |
CI/CD Pipeline | Implemented with pre-commit hooks and GitHub Actions workflows for continuous integration and delivery. |
Data Visualization | Final data models are served to PowerBI dashboards, providing insightful visualizations of the transit data. |
Near Real-Time Refresh | Data in the dashboards is refreshed every 15 minutes, ensuring up-to-date insights. |
Scalability and Flexibility | The project architecture supports scalable and flexible data processing, making it adaptable to varying data volumes and requirements. |
Environment Management | Different environments (dev, prod) are easily managed and configured using environment-specific tfvars files. |
e2e_de_project/
βββ .github
β βββ actions
β βββ workflows
βββ README.md
βββ extract_load
β βββ README.md
β βββ data_source.yaml
β βββ dlt_project
β βββ Dockerfile
β βββ pipeline_gtfs.py
β βββ pipeline_gtfs_rt.py
β βββ requirements.txt
β βββ sources
β β βββ __init__.py
β β βββ helpers.py
β β βββ schemas.py
β β βββ settings.py
β βββ tests
βββ infrastructure
β βββ README.md
β βββ 00_bootstrap
β βββ 01_vpc
β βββ 02_aks
β βββ 03_snowflake
β βββ 04_airflow
β βββ Makefile
β βββ global.tfvars
β βββ infracost.sh
β βββ terraform.sh
β βββ tests
βββ orchestrate
β βββ README.md
β βββ dags
β βββ docker
β βββ tests
βββ transform
β βββ README.md
β βββ snowflake-dbt
β βββ tests
βββ .pre-commit-config.yaml
βββ .tflint.hcl
βββ .yamllint
βββ .sqlfluff
βββ .hadolint.yaml
The project is organized as a monorepo, with each module developed and maintained independently. This approach ensures clear separation of concerns and ease of integration, enhancing maintainability, scalability, and collaboration. Each module handles a distinct part of the data pipeline. Below is an overview of each module.
Overview: The extract-load module provides robust pipelines for extracting and loading GTFS and GTFS Real-Time data from the ODPMS API using DLT.
Files:
pipeline_gtfs.py
: Script for loading GTFS data.pipeline_gtfs_rt.py
: Script for loading GTFS Real-Time data.Dockerfile
: Dockerfile for setting up the environment.requirements.txt
: List of dependencies.
Overview: The transform module leverages dbt and Snowflake for transforming and modeling data.
DBT Docs: link
Files:
dbt_project.yml
: Configuration for the dbt project.models/
: Directory containing dbt models for staging, intermediate, and marts layers.Dockerfile
: Dockerfile for setting up the environment.requirements.txt
: List of dependencies.
Overview: The orchestration module uses Apache Airflow to manage and automate the ELT pipelines.
Files:
dags/
: Directory containing Airflow DAGs for ELT processes.docker/
: Docker configurations for different ELT tasks.tests/
: Directory containing tests for the DAGs and Docker images.
Overview: The infrastructure module uses Terraform to automate the setup of cloud environments, organized in a unit-based structure.
Files:
main.tf
: Terraform configuration file for setting up the infrastructure.tfvars/
: Directory containing environment-specific variable files.Makefile
: Commands to manage infrastructure deployment.global.tfvars
: Global variables for Terraform configurations.terraform.sh
: Script to streamline Terraform operations.
Ensure you have the following accounts and tools set up before getting started:
-
Accounts:
-
Installed Tools:
- Azure CLI (az): For managing Azure resources.
- SnowSQL: For interacting with Snowflake.
- Terraform CLI: For infrastructure management.
- GitHub CLI: For GitHub operations.
- Kubernetes CLI (kubectl): For interacting with Kubernetes clusters.
- Helm: For managing Kubernetes applications.
- jq: For processing JSON in bash.
Before initializing the project, fill in the infrastructure/00_bootstrap/config.env
file with the required platform credentials. This file is used only once to bootstrap the infrastructure and can be removed afterward.
# Azure
TENANT_ID=""
# Snowflake
SNOWFLAKE_ACCOUNT_NAME="" # Your Snowflake account name for SnowSQL, for details: https://docs.snowflake.com/en/user-guide/admin-account-identifier
SNOWFLAKE_ACCOUNT_NAME_LOCATOR="" # Your Snowflake account name for DBT, for details: https://docs.getdbt.com/docs/core/connect-data-platform/snowflake-setup#account
SNOWFLAKE_USERNAME="" # Your Snowflake account username
SNOWSQL_PWD="" # Your Snowflake account password
# Github
REPO="" # The repo of your cloned project in format [HOST/]OWNER/REPO
SERVICE_PRINCIPAL_NAME="TF_USER_GITHUB" # Name of the Azure and Snowflake service principals for Github Actions
# Project
PROJECT_NAME=""
# Open data platform mobility Switzerland API token
ODPMS_TOKEN="" # Register and get a free token at https://opentransportdata.swiss/en/log-in/
Bootstrap the project by executing the following make command. This command will perform two main tasks:
-
Creates Azure Resources needed for Terraform remote state storage, including creating resource groups, storage accounts, and blob containers in Azure for different environments.
-
Sets Up GitHub Secrets for secure CI/CD operations. It involves logging into Azure, creating a service principal, assigning required roles, and setting Azure and Snowflake credentials as GitHub secrets.
cd infrastructure
make infra-bootstrap
With everything set up, execute the following make
command to build the entire infrastructure in a single run. For more details about the deployed infrastructure, refer to the infrastructure README:
make infra-up ENV=dev EXEC=local
If the build process completes successfully, the infrastructure should be up and running. Verify the status of the pods using the following kubectl
command:
kubectl get pods -A
If all pods, particularly those in the "airflow" namespace, are in the "Running" status, the setup is successful. Note that it might take a few minutes for all Airflow pods to be fully instantiated.
Once the infrastructure is up, the final step is to build the container images that Airflow will use to instantiate worker pods for the ELT process. The Docker images are defined in the orchestrate/docker
directory. You need to build and push these images to the Azure Container Registry.
You can either build the images one by one or use the following script to build all images:
cd infrastructure/04_airflow/deployment
./build_and_push_images.sh dev
Once the infrastructure is up and the Docker images are pushed to the container registry, you can connect to the Airflow UI to monitor the execution of ELT processes, view logs, and manage workflows. Use the following command to set up port forwarding, which allows you to access the Airflow web server from your local machine:
kubectl port-forward svc/airflow-webserver 8050:8080 --namespace airflow
Once the data is processed, it will be available for downstream reporting and visualization.
To run the pre-commit hooks and tests, you need to install several Bash packages and Python libraries. Follow the instructions below:
Install the necessary Python libraries by running the following command:
pip install -r requirements-dev.txt
Install the following linters and test frameworks:
Refer to the .pre-commit-config.yaml file for the specific versions to install.
# Extract-Load Module: Ensure the "extract_load_image" Docker image is built, then run tests
pytest extract_load/dlt_project/tests
# Transform Module: Ensure the "transform_image" Docker image is built, then run tests
pytest transform/tests
# Orchestrate Module: Run tests for DAGs
pytest orchestrate/tests/dags
# Orchestrate Module: Ensure the "check_update_image" Docker image is built, then run Docker-related tests
pytest orchestrate/tests/docker
# Infrastructure Module: Run tests for helper Bash scripts
cd infrastructure/tests
bats .
- Use Postgres as Airflow DB backend
- Enhance logging and monitoring features.
- Enhance security with additional policies and auditing
- Use Case: Advanced Analytics
- Use Apache Iceberg tables with Snowflake