Transit Data Platform

Transit Service Review | Advanced Analytics | DBT Docs

A Scalable Data Platform Implementation

Developed with the software and tools below.

📍 Overview
📊 Use Cases
- Transit Service Review
- Advanced Analytics
📦 Features
📂 Repository Structure
🧩 Modules
🚀 Getting Started
🧪 Tests
- Install Dependencies
- Run Tests
🛠 Project Roadmap

📍 Overview

This project is an implementation of a scalable, near real-time data platform built on top of the Open Data Platform Mobility Switzerland (ODPMS) API.

Key components are seamlessly integrated to ensure efficient data ingestion, transformation, and visualization. The data pipeline initiates with extracting and loading data into Snowflake using DLT. Transformations are then carried out using dbt to create final data models, ready for near real-time visualization in PowerBI dashboards. The entire ELT process is orchestrated by Airflow running on an AKS cluster. Infrastructure is managed as code with Terraform, and the CI/CD pipeline is streamlined with pre-commit hooks and GitHub Actions workflows, ensuring continuous integration and delivery.

📊 Use Cases

Here are a few use cases implemented to demonstrate the practical applications of the data platform.

Transit Service Review

Power BI report

Advanced Analytics

Probabilistic On-Time Travel Advisor (In Development)

A Machine Learning model that aims to answer the question: "When should I leave to arrive at my destination before a specific time with a given probability, given the recent state of the transit network?"

📦 Features

Feature	Description
Modular Architecture	The project is organized into distinct modules: extract_load, transform, orchestrate, infrastructure, and CICD, each handling a specific part of the data pipeline.
Data Ingestion	Utilizes the DLT (Data Load Tool) Python library for robust data extraction and loading from the ODPMS API to Snowflake.
Data Transformation	Uses dbt for data modeling and transformation within Snowflake, ensuring data integrity and consistency.
ELT Orchestration	Apache Airflow running on an AKS cluster manages and automates the ELT processes, ensuring reliability and scalability.
Infrastructure Management	Infrastructure as code is handled with Terraform, allowing for easy deployment, updates, and scaling.
CI/CD Pipeline	Implemented with pre-commit hooks and GitHub Actions workflows for continuous integration and delivery.
Data Visualization	Final data models are served to PowerBI dashboards, providing insightful visualizations of the transit data.
Near Real-Time Refresh	Data in the dashboards is refreshed every 15 minutes, ensuring up-to-date insights.
Scalability and Flexibility	The project architecture supports scalable and flexible data processing, making it adaptable to varying data volumes and requirements.
Environment Management	Different environments (dev, prod) are easily managed and configured using environment-specific tfvars files.

📂 Repository Structure

e2e_de_project/
├── .github
│   ├── actions
│   └── workflows
├── README.md
├── extract_load
│   ├── README.md
│   ├── data_source.yaml
│   └── dlt_project
│       ├── Dockerfile
│       ├── pipeline_gtfs.py
│       ├── pipeline_gtfs_rt.py
│       ├── requirements.txt
│       ├── sources
│       │   ├── __init__.py
│       │   ├── helpers.py
│       │   ├── schemas.py
│       │   └── settings.py
│       └── tests
├── infrastructure
│   ├── README.md
│   ├── 00_bootstrap
│   ├── 01_vpc
│   ├── 02_aks
│   ├── 03_snowflake
│   ├── 04_airflow
│   ├── Makefile
│   ├── global.tfvars
│   ├── infracost.sh
│   ├── terraform.sh
│   └── tests
├── orchestrate
│   ├── README.md
│   ├── dags
│   ├── docker
│   └── tests
├── transform
│   ├── README.md
│   ├── snowflake-dbt
│   └── tests
├── .pre-commit-config.yaml
├── .tflint.hcl
├── .yamllint
├── .sqlfluff
└── .hadolint.yaml

🧩 Modules

The project is organized as a monorepo, with each module developed and maintained independently. This approach ensures clear separation of concerns and ease of integration, enhancing maintainability, scalability, and collaboration. Each module handles a distinct part of the data pipeline. Below is an overview of each module.

➡️Extract-Load

Overview: The extract-load module provides robust pipelines for extracting and loading GTFS and GTFS Real-Time data from the ODPMS API using DLT.

Files:

pipeline_gtfs.py: Script for loading GTFS data.
pipeline_gtfs_rt.py: Script for loading GTFS Real-Time data.
Dockerfile: Dockerfile for setting up the environment.
requirements.txt: List of dependencies.

➡️Transform

Overview: The transform module leverages dbt and Snowflake for transforming and modeling data.

DBT Docs: link

Files:

dbt_project.yml: Configuration for the dbt project.
models/: Directory containing dbt models for staging, intermediate, and marts layers.
Dockerfile: Dockerfile for setting up the environment.
requirements.txt: List of dependencies.

➡️Orchestrate

Overview: The orchestration module uses Apache Airflow to manage and automate the ELT pipelines.

Files:

dags/: Directory containing Airflow DAGs for ELT processes.
docker/: Docker configurations for different ELT tasks.
tests/: Directory containing tests for the DAGs and Docker images.

➡️Infrastructure

Overview: The infrastructure module uses Terraform to automate the setup of cloud environments, organized in a unit-based structure.

Files:

main.tf: Terraform configuration file for setting up the infrastructure.
tfvars/: Directory containing environment-specific variable files.
Makefile: Commands to manage infrastructure deployment.
global.tfvars: Global variables for Terraform configurations.
terraform.sh: Script to streamline Terraform operations.

🚀 Getting Started

Prerequisites

Ensure you have the following accounts and tools set up before getting started:

Accounts:
Installed Tools:
- Azure CLI (az): For managing Azure resources.
- SnowSQL: For interacting with Snowflake.
- Terraform CLI: For infrastructure management.
- GitHub CLI: For GitHub operations.
- Kubernetes CLI (kubectl): For interacting with Kubernetes clusters.
- Helm: For managing Kubernetes applications.
- jq: For processing JSON in bash.

Bootstrap the Project

Before initializing the project, fill in the infrastructure/00_bootstrap/config.env file with the required platform credentials. This file is used only once to bootstrap the infrastructure and can be removed afterward.

# Azure
TENANT_ID=""

# Snowflake
SNOWFLAKE_ACCOUNT_NAME="" # Your Snowflake account name for SnowSQL, for details: https://docs.snowflake.com/en/user-guide/admin-account-identifier
SNOWFLAKE_ACCOUNT_NAME_LOCATOR="" # Your Snowflake account name for DBT, for details: https://docs.getdbt.com/docs/core/connect-data-platform/snowflake-setup#account
SNOWFLAKE_USERNAME="" # Your Snowflake account username
SNOWSQL_PWD="" # Your Snowflake account password

# Github
REPO="" # The repo of your cloned project in format [HOST/]OWNER/REPO
SERVICE_PRINCIPAL_NAME="TF_USER_GITHUB" # Name of the Azure and Snowflake service principals for Github Actions

# Project
PROJECT_NAME=""

# Open data platform mobility Switzerland API token
ODPMS_TOKEN="" # Register and get a free token at https://opentransportdata.swiss/en/log-in/

Bootstrap the project by executing the following make command. This command will perform two main tasks:

Creates Azure Resources needed for Terraform remote state storage, including creating resource groups, storage accounts, and blob containers in Azure for different environments.
Sets Up GitHub Secrets for secure CI/CD operations. It involves logging into Azure, creating a service principal, assigning required roles, and setting Azure and Snowflake credentials as GitHub secrets.

cd infrastructure
make infra-bootstrap

Build the infrastructure

With everything set up, execute the following make command to build the entire infrastructure in a single run. For more details about the deployed infrastructure, refer to the infrastructure README:

make infra-up ENV=dev EXEC=local

If the build process completes successfully, the infrastructure should be up and running. Verify the status of the pods using the following kubectl command:

kubectl get pods -A

If all pods, particularly those in the "airflow" namespace, are in the "Running" status, the setup is successful. Note that it might take a few minutes for all Airflow pods to be fully instantiated.

Build the Docker images

Once the infrastructure is up, the final step is to build the container images that Airflow will use to instantiate worker pods for the ELT process. The Docker images are defined in the orchestrate/docker directory. You need to build and push these images to the Azure Container Registry.

You can either build the images one by one or use the following script to build all images:

cd infrastructure/04_airflow/deployment
./build_and_push_images.sh dev

Monitor Airflow Jobs

Once the infrastructure is up and the Docker images are pushed to the container registry, you can connect to the Airflow UI to monitor the execution of ELT processes, view logs, and manage workflows. Use the following command to set up port forwarding, which allows you to access the Airflow web server from your local machine:

kubectl port-forward svc/airflow-webserver 8050:8080 --namespace airflow

Once the data is processed, it will be available for downstream reporting and visualization.

🧪 Tests

Install Dependencies

To run the pre-commit hooks and tests, you need to install several Bash packages and Python libraries. Follow the instructions below:

Python Requirements

Install the necessary Python libraries by running the following command:

pip install -r requirements-dev.txt

Linters and Test Frameworks

Install the following linters and test frameworks:

Refer to the .pre-commit-config.yaml file for the specific versions to install.

Run Tests

# Extract-Load Module: Ensure the "extract_load_image" Docker image is built, then run tests
pytest extract_load/dlt_project/tests

# Transform Module: Ensure the "transform_image" Docker image is built, then run tests
pytest transform/tests

# Orchestrate Module: Run tests for DAGs
pytest orchestrate/tests/dags

# Orchestrate Module: Ensure the "check_update_image" Docker image is built, then run Docker-related tests
pytest orchestrate/tests/docker

# Infrastructure Module: Run tests for helper Bash scripts
cd infrastructure/tests
bats .

🛠 Project Roadmap

Use Postgres as Airflow DB backend
Enhance logging and monitoring features.
Enhance security with additional policies and auditing
Use Case: Advanced Analytics
Use Apache Iceberg tables with Snowflake

⬆️Back to top

scpkobayashi/e2e_de_project

Transit Data Platform

Table of Contents

📍 Overview

📊 Use Cases

Transit Service Review

Advanced Analytics

Probabilistic On-Time Travel Advisor (In Development)

📦 Features

📂 Repository Structure

🧩 Modules

➡️Extract-Load

➡️Transform

➡️Orchestrate

➡️Infrastructure

🚀 Getting Started

Prerequisites

Bootstrap the Project

Build the infrastructure

Build the Docker images

Monitor Airflow Jobs

🧪 Tests

Install Dependencies

Python Requirements

Linters and Test Frameworks

Run Tests

🛠 Project Roadmap