Analytics of Train Delays of UK Rail

Preface
Introduction
Dataset description
Objective and requirements
Solution
Future development

Preface

This repository contains the final project for the Data Engineering Zoomcamp - a free course organized by the DataTalks.Club community. Course tutors: Victoria Perez Mola, Sejal Vaidya, Ankush Khanna, Alexey Grigorev. The project covers main data engineering skills taught in the course: Infrastructure as code, containerization, data lake, ETL/ELT, pipeline orchestration and scheduling, batch processing, data warehousing, reporting.

Introduction

It was initially intended that the data set would consist of multiple files with a number of records at least at a level of thousands. Thus, a monthly load can be simulated with an amount of data high enough to consider ELT instead of ETL. After a thorough review of the available free data sets, UK passenger train delays data set from NetwokRail was chosen https://www.networkrail.co.uk/who-we-are/transparency-and-ethics/transparency/open-data-feeds/.

Dataset description

There are two types of data in the dataset - zipped csv files, containing delays in minutes with the attribute codes, and an xlsx file with the attribute glossary.

Historic delay files

Comma-separated files containing attribute codes (strings and integers), date/datetime attributes, and delay values (float). Each file covers delays over 28/29 day span, leading to 13 files per two-year period: e.g. 2020/21 covers 13 periods starting on 1. April 2020 (encoded as 2020/21_P1) and ending on 31 March 2021 (encoded as 2020/21_P13).

Key attributes:

EVENT_DATETIME - date and time when the event leading to the delay happened. Either DD-MMM-YYYY HH:mm (e.g. 11-NOV-2018 17:53) or DD/MM/YYYY HH:mm (e.g. 08/12/2019 13:43) format.
INCIDENT_NUMBER - number of the incident (forms unique identifier together with EVENT_DATETIME)
PERFORMANCE_EVENT_CODE - Whether the train has been delayed or canceled. A and M denote delays, C – is a full cancelation, D is a diversion, F is a failure to stop, S is a scheduled cancellation, and O/P are part cancellations
START_STANOX/END_STANOX - the location of the delay (not the incident)
RESPONSIBLE_MANAGER - who within the industry is responsible for the delay
OPERATOR_AFFECTED - code of the cmpany that operates the delayed train
INCIDENT_REASON - the Delay Attribution Guide cause code for the incident

IMPORTANT: Delay files have schema and date/datetime representation changing.

Historic-Delay-Attribution-Glossary

xlsx file containing sheets with attribute glossary:

Stanox Codes
Period Dates
Incident Reason
Responsible Manager
Reactionary Reason Code
Performance Event Code
Service Group Code
Operator Name
Train Service Code

Objective and requirements

Develop end-to-end data solution to perform advanced analysis of the train delays. The solution must meet the following requirements:

Infrastructure provided via code (IaC approach)
Cloud storage - data lake and data warehouse
Create separate pipelines for historical data and attributes. The historical data pipeline must run every month, attribute file ingestion pipeline will be triggered ad-hoc.
Data transformations must be performed using software engineering principles (data documentation, testing) and in a way that data analysts can easily understand and modify transformations.
Create an interactive report.

Solution

Technology stack

Data ingestion pipeline: Apache Airflow
Data stores: Google Cloud Storage (data lake), Google BigQuery (DWH)
Data transformation: dbt cloud
Reporting: Google Data Studio

Infrastructure as a code

Google Cloud infrastructure is provisioned with Terraform using IaC approach. Google storage bucket, BigQuery data set as well as landing tables for historical and attribute data are defined in 01-terraform/main.tf.

Data pipeline orchestration and scheduling

Initial ingestion of the historical data is done using dockerized Apache Airflow running on a Linux host. Files stored in *zip format on a web server are saved into a local folder, unzipped, converted into *parquet format, forcing STRING data type to all columns. The latter is done to speed up the load process and to avoid type errors while loading into BigQuery tables. Parquet files are transferred to Google Cloud storage and a temporal external table is created. The table's content is appended to the landing table with the addition of insert_datetime column for later version control. Finally, a clean-up of the local folder is done. Important the historical data files have changing name and order of the columns, different SQL queries are used to append data from different time periods.

The pipeline is scheduled to run every month, the corresponding download links are stored in a json file.

DAG to ingest historical delays data:

DAG to ingest attributes data:

Data modeling

Data modeling is done in dbt cloud. dbt allows using software engineering approaches for data modeling. It offers native tools for testing and documentation; data infrastructure (DWH schemas, tables, views) is created automatically. On the other hand, data transformation is done using SQL, lowering thus entry threshold. All artifacts are stored in a git repository allowing version control and distributed development.

In the current project, dbt cloud is used to stage historical delays data (03-dbt/models/staging): the landing table records are filtered on the latest insert_datetime value, type casting is done. An additional source of train station attributes is loaded a seed to 03-dbt/seeds. In the core of the solution (03-dbt/models/core), station/locations and date dimension tables are created. The latter was intended to be used for hierarchical filtering in the reporting tool but was abandoned for the most due to limitations of Google Data Studio. Four data marts are created, which aggregate delay times in different groups.

Full data lineage of the dbt project:

Reporting

The report is built on top of the data marts created in dbt and is shared in Google Data Studio. It includes a date period filter and basic diagrams to analyze delays:

Total number of delays and total delays time in minutes;
Number of delays and total delays time per day;
Distribution of delay count over the regions;
Number of events - delays and cancelation;
Geography of delays.

IliaSinev/train-delays-analytics