This repository is related to the Final Project of the course ID2223 Scalable Machine Learning and Deep Learning at KTH. The project proposal can be found here.
"DeLight" - Delayed fLights, our final project, consist in a Serverless Machine Learning pipeline able to predict the flight delay of flight daily departing from Stockholm Arlanda International Airport, depending on the weather condition and historical flight delay information. In this repository you can find:
- Machine Learning Serverless Pipeline comprehensive of running daily script for both local and remote environments.
- Flight Info and Meteorological Analysis (MESAN) in Arlanda of the full year 2023.
- Graphical User Interface on Gradio running on HuggingFace 🤗, where you can play with the data (our own version of tl:dr)
The project has been developed in Python on our own machines, but we have used Jupyter notebooks for Exploratory Data Analysis for a more insightful data visualization.
The project can be divided into four main pipelines:
- Historical data collection and preparation (historical feature pipeline): the features on weather condition and flight information have been extracted from different source, such as API Vendors and institutional OpenData archives. Then, data have been processed, studied and uniformed to be used for training. The data are saved remotely in Hopsworks Feature Store.
- Daily data collection (real-time feature pipeline): everyday new data are extracted from various sources and processed, accordingly to the standards and rules set by the historical data, to be used for data augmentation and quotidian model training.
- Model training and evaluation (training pipeline): a new model is trained everyday, with a bigger amount of data, thanks to the new data gathered in the second pipeline and to daily scripts running remotely on Modal. The model is then saved in Hopsworks Model Registry.
- Batch prediction and result visualisation (inference pipeline): everyday at midnight the daily forecast and flight schedules for the two following days are collected by a remote script. The model is accessed from the remote storage and delay predictions are made. The prediction results can be accessed through a user interface hosted on HuggingFace 🤗 created with Gradio. This application allows to use and see the model's functionalities.
Here is where the historical data are collected, in order to have a feature base to train our machine learning algorithm. The typology and sources of our data are:
- Meteorological Analysis collected through SMHI OpenData Grid Archive
- Flight Information collected through Zyla API Hub - Historical Flights Information API
Due to the need of collecting a big amount of heterogenous data, saved in different formats, such as GRIB
for Meteorological Data and .json
for Flight Information, the feature pipeline is divided in weather, flight and merged feature pipelines.
Both the weather and flight feature pipeline are divided in the following:
- Collector: data from every day and every hour are iteratively collected from respectively the SMHI online library and through a Flight Info API (Zyla Flight API can be replaced by competitors alternatives). The data are locally saved before to be dealt with.
- Extractor: data from each saved file are extracted from their own format, using the Python library
pygrib
to access meteorologicalGRIB
files. All the raw data for both weather and flight data are saved into different files.csv
. - Processor: data are transformed into a shared uniform format in order to be studied and be used for training. For instance, datetimes are split of in year, month, day and hour, depending on the need; wind direction and coordinates are rotated to standard configuration. Finally, timezone is set to Stockholm, with DST, in both dataset.
Then the two dataset created are merged, through the script Dataset Merger. Once the dataset has been created, we pass through an Exploratory Data Analysis, in order to evaluate which factor influence most the flight delay. More on that in the Results.
Selected the right feature with the best attributes (and some other promising for a future when will have more data) are uploaded with the last script of the historical pipeline chain, called Dataset Uploader. Through this procedure, the file are uploaded and saved in a dedicated Hopsworks' Feature Group.
Everynight a scripts runs remotely on Modal, acquiring the daily schedule and the meteorological analysis of two days before. Backfill Feature Pipeline script cleans and transforms the data into the project format and saves them on the Feature Group. Thanks to that, our dataset is constantly evolving, making it for real a dynamic source of data.
The meteorological analysis are extracted as the historical data through SMHI OpenData Grid Archive, but also through the SMHI OpenData Meteorological Analysis API. Instead, the Swedavia Flight Info v2 API is used to access flight information of departed flight from Stockholm Arlanda International Airport.
Greatest part of the feature engineering, including several model-independent transformation said before, has been done. However, having selected a Gradient Boosting framework called XGBoost
as our own regression model, we still need to do some model-dependent transformation, as binning and labelling some attributes, remove attribute-specific outliers or create some dummy variables.
Set up the data standard, we can start the Model Evaluation and Selection process, where thanks to scikit-learn
module called GridSearchCV
we tune hyperparameters for our own model:
Name | Best Value |
---|---|
n_estimators | 45 |
max_depth | 15 |
eta | 0.05 |
subsample | 0.85 |
Then a first version of the model is trained on the data accessed on the Hopsworks' Feature Group, and uploaded in the Hopsworks' Model Registry, through the Initializer. Everynight a scripts runs remotely on Modal, getting the data from the feature store the old data and the new acquired data from the day. The Daily Training Pipeline script trains a new model and save it in the model registry, replacing the previous version.
Everynight at midnight a scripts runs remotely on Modal, and predicts the departure's delay of flight departing from Stockholm Arlanda International Airport, for the day itself and the following one. This is done by collecting through Swedavia Flight Info v2 API the flight information, while the meteological forecast are accessed through the last API of this long list, SMHI Open Data Meteorological Forecast API.
Collected the data, the last trained version of the regression model is downloaded from our Hopsworks' Model Registry and the new predictions are calculated. Those are saved into Hopsworks' File System, replacing the day before prediction.
The results of our work can also be 👀 seen out of this repo, by 🛬 landing on a fun 🛩️ User Interface
🛩️ Select a specific flight, by answering to some absolutely relevant questions.
All the files needed to recreate the same graphical interface or test your own with a synthetic dataset can be found in the User Interface folder.
Since the fisrt EDA and then through further analysis the data showed a low correlation (<20% for whichever variable) between flight schedules, wheather conditions and flight delay. However, from the starting point, significant improvements have been obtained by feature selection, binning variables and creating some dummy variables, especially with the most incisive variables (e.g. wind from south-east, temperature below -20°C, local flights). Analyzing the data we have discovered that such phenomena just described are more an expection than a recurrent event, so we can assume that those few feature are not numerous enough to influence the model significantly.
Another significant factor is the arguably low number of features collected, since the data consists only in 2023 departing flight from Stockholm Arlanda International Airport.
Future improvement could be pursued by waiting for a bigger amount of feature, as well as more intense model tuning, model substitution with different models or adding new features collecting data from other sources with more various data (e.g. number of arrivals around departing time)
The whole project has been developed with high-attention for replicability and future-proofing. Indeed, it is possible to replicate this whole project by only running scripts locally on your own laptop, just with a below defined Python environment, without the use of any other platform, as Modal or Hopsworks.
In order to be ready, you first need to set up your own environment, according to the file environment.yaml
. This can be done easy via conda
, through the command conda env create --file environment.yaml
, if you have any distribution of Conda, Anaconda, Miniconda, etc., installed on your machine. This will set up for you the right version of Python, with all the dependencies needed for this project, including pip
dependencies.
Then, you will need to grant access to a flight information and a weather forecast/analysis provider, so you will need to get the API Key from that service. Swedavia APIs and SMHI OpenData APIs are our suggested free option, for flights from and to Sweden and weather information in a broader area above the Scandinavian countries. About the historical data, you can access easily to all the stages of the data cleaning and processing in Datasets.
Roll up your sleeves. Now you are ready for real!
Visual Studio Code - main IDE
GitKraken - git versioning
Google Colab - running environment
HuggingFace - dataset, model registry, GUI
Gradio - GUI
Modal - run daily remote script
Hopsworks - MLOps platform
Zyla API - historical flight API
Swedavia API - real-time flight API
SMHI API - real-time forecast API
SMHI OpenData - historical meteorological API