/sml-project-2023-manfredi-meneghin

Scalable Machine Learning and Deep Learning, Final Project, 2023/2024

Primary LanguagePython

“DeLight” Delayed fLights: A Dynamic Serverless Machine Learning pipeline

About

This repository is related to the Final Project of the course ID2223 Scalable Machine Learning and Deep Learning at KTH. The project proposal can be found here.

"DeLight" - Delayed fLights, our final project, consist in a Serverless Machine Learning pipeline able to predict the flight delay of flight daily departing from Stockholm Arlanda International Airport, depending on the weather condition and historical flight delay information. In this repository you can find:

The project has been developed in Python on our own machines, but we have used Jupyter notebooks for Exploratory Data Analysis for a more insightful data visualization.

The Team

Table of Contents

Introduction

The project can be divided into four main pipelines:

  • Historical data collection and preparation (historical feature pipeline): the features on weather condition and flight information have been extracted from different source, such as API Vendors and institutional OpenData archives. Then, data have been processed, studied and uniformed to be used for training. The data are saved remotely in Hopsworks Feature Store.
  • Daily data collection (real-time feature pipeline): everyday new data are extracted from various sources and processed, accordingly to the standards and rules set by the historical data, to be used for data augmentation and quotidian model training.
  • Model training and evaluation (training pipeline): a new model is trained everyday, with a bigger amount of data, thanks to the new data gathered in the second pipeline and to daily scripts running remotely on Modal. The model is then saved in Hopsworks Model Registry.
  • Batch prediction and result visualisation (inference pipeline): everyday at midnight the daily forecast and flight schedules for the two following days are collected by a remote script. The model is accessed from the remote storage and delay predictions are made. The prediction results can be accessed through a user interface hosted on HuggingFace 🤗 created with Gradio. This application allows to use and see the model's functionalities.

Architecture

Machine Learning Pipeline Schema

Pipelines description

Historical feature pipeline

Here is where the historical data are collected, in order to have a feature base to train our machine learning algorithm. The typology and sources of our data are:

Due to the need of collecting a big amount of heterogenous data, saved in different formats, such as GRIB for Meteorological Data and .json for Flight Information, the feature pipeline is divided in weather, flight and merged feature pipelines.

Both the weather and flight feature pipeline are divided in the following:

  • Collector: data from every day and every hour are iteratively collected from respectively the SMHI online library and through a Flight Info API (Zyla Flight API can be replaced by competitors alternatives). The data are locally saved before to be dealt with.
  • Extractor: data from each saved file are extracted from their own format, using the Python library pygrib to access meteorological GRIB files. All the raw data for both weather and flight data are saved into different files .csv.
  • Processor: data are transformed into a shared uniform format in order to be studied and be used for training. For instance, datetimes are split of in year, month, day and hour, depending on the need; wind direction and coordinates are rotated to standard configuration. Finally, timezone is set to Stockholm, with DST, in both dataset.

Then the two dataset created are merged, through the script Dataset Merger. Once the dataset has been created, we pass through an Exploratory Data Analysis, in order to evaluate which factor influence most the flight delay. More on that in the Results.

Selected the right feature with the best attributes (and some other promising for a future when will have more data) are uploaded with the last script of the historical pipeline chain, called Dataset Uploader. Through this procedure, the file are uploaded and saved in a dedicated Hopsworks' Feature Group.

Realtime feature pipeline

Everynight a scripts runs remotely on Modal, acquiring the daily schedule and the meteorological analysis of two days before. Backfill Feature Pipeline script cleans and transforms the data into the project format and saves them on the Feature Group. Thanks to that, our dataset is constantly evolving, making it for real a dynamic source of data.

The meteorological analysis are extracted as the historical data through SMHI OpenData Grid Archive, but also through the SMHI OpenData Meteorological Analysis API. Instead, the Swedavia Flight Info v2 API is used to access flight information of departed flight from Stockholm Arlanda International Airport.

Training pipeline

Greatest part of the feature engineering, including several model-independent transformation said before, has been done. However, having selected a Gradient Boosting framework called XGBoost as our own regression model, we still need to do some model-dependent transformation, as binning and labelling some attributes, remove attribute-specific outliers or create some dummy variables.

Set up the data standard, we can start the Model Evaluation and Selection process, where thanks to scikit-learn module called GridSearchCV we tune hyperparameters for our own model:

Name Best Value
n_estimators 45
max_depth 15
eta 0.05
subsample 0.85

Then a first version of the model is trained on the data accessed on the Hopsworks' Feature Group, and uploaded in the Hopsworks' Model Registry, through the Initializer. Everynight a scripts runs remotely on Modal, getting the data from the feature store the old data and the new acquired data from the day. The Daily Training Pipeline script trains a new model and save it in the model registry, replacing the previous version.

Inference pipeline

Everynight at midnight a scripts runs remotely on Modal, and predicts the departure's delay of flight departing from Stockholm Arlanda International Airport, for the day itself and the following one. This is done by collecting through Swedavia Flight Info v2 API the flight information, while the meteological forecast are accessed through the last API of this long list, SMHI Open Data Meteorological Forecast API.

Collected the data, the last trained version of the regression model is downloaded from our Hopsworks' Model Registry and the new predictions are calculated. Those are saved into Hopsworks' File System, replacing the day before prediction.

Results

Graphic User Interface

The results of our work can also be 👀 seen out of this repo, by 🛬 landing on a fun 🛩️ User Interface✈️ hosted on HuggingFace🤗. With three different tabs, you can decide to:

🛩️ Select a specific flight, by answering to some absolutely relevant questions. ✈️ View the full schedule of today or tomorrow flights, with respective delay. 📊 Took a glance of model performances and dataset size, with daily updates.

All the files needed to recreate the same graphical interface or test your own with a synthetic dataset can be found in the User Interface folder.

Model Performance

Since the fisrt EDA and then through further analysis the data showed a low correlation (<20% for whichever variable) between flight schedules, wheather conditions and flight delay. However, from the starting point, significant improvements have been obtained by feature selection, binning variables and creating some dummy variables, especially with the most incisive variables (e.g. wind from south-east, temperature below -20°C, local flights). Analyzing the data we have discovered that such phenomena just described are more an expection than a recurrent event, so we can assume that those few feature are not numerous enough to influence the model significantly.

Another significant factor is the arguably low number of features collected, since the data consists only in 2023 departing flight from Stockholm Arlanda International Airport.

Future improvement could be pursued by waiting for a bigger amount of feature, as well as more intense model tuning, model substitution with different models or adding new features collecting data from other sources with more various data (e.g. number of arrivals around departing time)

How to run

The whole project has been developed with high-attention for replicability and future-proofing. Indeed, it is possible to replicate this whole project by only running scripts locally on your own laptop, just with a below defined Python environment, without the use of any other platform, as Modal or Hopsworks.

In order to be ready, you first need to set up your own environment, according to the file environment.yaml. This can be done easy via conda, through the command conda env create --file environment.yaml, if you have any distribution of Conda, Anaconda, Miniconda, etc., installed on your machine. This will set up for you the right version of Python, with all the dependencies needed for this project, including pip dependencies.

Then, you will need to grant access to a flight information and a weather forecast/analysis provider, so you will need to get the API Key from that service. Swedavia APIs and SMHI OpenData APIs are our suggested free option, for flights from and to Sweden and weather information in a broader area above the Scandinavian countries. About the historical data, you can access easily to all the stages of the data cleaning and processing in Datasets.

Roll up your sleeves. Now you are ready for real!

Software used

Visual Studio Code - main IDE

GitKraken - git versioning

Google Colab - running environment

HuggingFace - dataset, model registry, GUI

Gradio - GUI

Modal - run daily remote script

Hopsworks - MLOps platform

Zyla API - historical flight API

Swedavia API - real-time flight API

SMHI API - real-time forecast API

SMHI OpenData - historical meteorological API