/CISC849Final

This repo is used for final project of CISC849

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Predicting Flight Cancellations During the COVID-19 Pandemic

Take away points

This repo is used for final project of CISC849, Fall 2020, a graduate level seminar offered at the University of Delaware. Our group consists of Matt Leinhauser and Eric (Yifan) Zhang.

We hosted a competition between automated machine learning pipeline generating tools and manually selected model. We have explored the usage of TPOT on both CPU and GPU, we also explicitly selected models from several basic machine learning algorithms. Besides the competition, we discussed data balancing technique to achieve better performance. This project could give an example of the comparison between auto ML tools and human selection on a same data science task.

Context of this project - TPOT

This course requires each group reading a paper (or article) related to data science and presenting it formally to the entire class. We chose an article about TPOT [1], which is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

We uploaded the slides we used for the presentation, check TPOT Paper presentation.pptx in this repo. In the slides, we describe and visualize the main technique involved in TPOT.

Project idea

We were also required to conduct a final project for this course. The final project needs to relate interesting tasks to data science. First, we had to present a proposal about the final project. In this proposal, we identified the task, dataset, tentative technique/ML algorithms, and timeline.

We checked Kaggle and found a dataset called COVID-19 Airline Flight Delays and Cancellations - Which airlines have been the most affected by COVID-19?[2]. It has the following description:

The United States Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. The data collected is from January - June 2020 and contains relevant flight information (on-time, delayed, canceled, diverted flights) from the Top 10 United States flight carriers for 11 million flights.

As COVID-19 is still impacting the entire world (at the time of writing, December 2020), this dataset provides some insights into how it affect the airline industry. We found this topic interesting and after discussion, we planned to use this dataset to try to predict the flight cancellations based on the information contained in this dataset.

We have presented TPOT as a useful automated tool to generate ML pipelines, one drawback of TPOT is that it requires relatively a lot of time for generating. So an idea came to mind: What about hosting a competition between TPOT and a manually-selected model? One of our group members (Eric) focused on learning how to generate optimal machine learning pipelines using TPOT and the other group member (Matt) focused on manually selecting the best models.

Goal

Our project had the following goals: 1.) Accurately predict airline flight cancellations during the COVID-19 pandemic from Jan. 2020 - Jun. 2020 2.) Become familiar with TPOT and see if TPOT can give a good machine learning pipeline that will help us achieve Goal 1. 3.) Have one group member not use TPOT and not look at the TPOT results and see if they can create a better machine learning pipeline than TPOT. a.) It is worth noting that this goal is measured in terms of accuracy on the data.

Data description

There are 47 features for each data item, the dataset has a file called ColumnDescription.txt to describe each feature. We analyzed all features by two methods: 1) Manually categorize them. 2) Compute and plot features correlations. We provide categories, feature names, descriptions, and whether or not kept in belowing table, where {1} indicates CPU-TPOT, {2} indicates GPU-TPOT.

Categories Feature Description Kept
Time YEAR Year
Time QUARTER 1: Jan-Mar, 2: Apr-Jun, 3: Jul-Sep, 4: Oct-Dec {1}
{2}
Time MONTH Month of Year {1}
{2}
Time DAY_OF_MONTH Date of Month {1}
{2}
Time DAY_OF_WEEK Day of Week (1: Monday, 7: Sunday) {1}
{2}
Time FL_DATE Full flight date (M/DD/YYYY)
Identification MKT_UNIQUE_CARRIER Airline Carrier Code:
AA: American Airlines
AS: Alaska Airlines
B6: JetBlue
DL: Delta Air Lines
F9: Frontier Airlines
G4: Allegiant Air
HA: Hawaiian Airlines
NK: Spirit Airlines
UA: United Airlines
WN: Southwest Airlines
{1}
{2}
Identification MKT_CARRIER_FL_NUM Flight Number {1}
{2}
Identification TAIL_NUM Aircraft Tail Number (Usually starts with 'N') {2}
Location ORIGIN Flight Departure 3-Letter Airport Abbreviation {1}
{2}
Location ORIGIN_CITY_NAME Flight Departure City, State Names
Location ORIGIN_STATE_ABR Flight Departure 2-Letter State Abbreviation
Location ORIGIN_STATE_NM Flight Departure State Name
Location DEST Flight Arrival 3-Letter Airport Abbreviation {1}
{2}
Location DEST_CITY_NAME Flight Arrival City, State Names
Location DEST_STATE_ABR Flight Arrival 2-Letter State Abbreviation
Location DEST_STATE_NM Flight Arrival State Name
Departure CRS_DEP_TIME Scheduled Departure Time (HHMM) (Single or 2-Digit Values Represent 00:MM, e.g. 3 represents 00:03 or 12:03 AM) {1}
{2}
Departure DEP_TIME Actual Departure Time (HHMM) {2}
Departure DEP_DELAY Departure Delay (Difference Between Actual Departure Time and Scheduled Departure Time in Minutes) {2}
Departure DEP_DELAY_NEW Departure Delay Ignoring Early Departures (Listed as 0)
Departure DEP_DEL15 Departure Delay Greater Than 15 Minutes (0: Not Greater Than 15, 1: Greater Than 15) {2}
Departure DEP_DELAY_GROUP Departure Delay in Number of 15-minute increments Rounded Down (e.g. Early Departure (< 0) is a value of -1, 30 or 42 minutes is a value of 2) {2}
Departure DEP_TIME_BLK Scheduled Departure Time in Hourly Block (HHMM) {1}
Departure TAXI_OUT Time between Airplane Taxi from Gate and Takeoff (WHEELS_OFF) Time (in Minutes)
Departure WHEELS_OFF Time of Airplane Takeoff (HHMM)
Arrival WHEELS_ON Time of Airplane Landing (HHMM)
Arrival TAXI_IN Time between Airplane Taxi to Gate and Landing (WHEELS_ON) Time (in Minutes)
Arrival CRS_ARR_TIME Scheduled Arrival Time (HHMM) (Single or 2-Digit Values Represent 00:MM, e.g. 3 represents 00:03 or 12:03 AM) {1}
{2}
Arrival ARR_TIME Actual Arrival Time (HHMM) {2}
Arrival ARR_DELAY Arrival Delay (Difference Between Actual Arrival Time and Scheduled Arrival Time in Minutes) {2}
Arrival ARR_DELAY_NEW Arrival Delay Ignoring Early Arrivals (Listed as 0)
Arrival ARR_DEL15 Arrival Delay Greater Than 15 Minutes (0: Not Greater Than 15, 1: Greater Than 15) {2}
Arrival ARR_DELAY_GROUP Arrival Delay in Number of 15-minute increments Rounded Down (e.g. Early Arrival (< 0) is a value of -1, 30 or 42 minutes is a value of 2) {2}
Arrival ARR_TIME_BLK Scheduled Arrival Time in Hourly Block (HHMM) {1}
Cancellation CANCELLED 0: Flight Not Cancelled, 1: Flight Cancelled {1}
{2}
Cancellation CANCELLATION_CODE Reason for Cancellation - if Cancelled, Letter Present (A: Carrier, B: Weather, C: National Aviation System, D: Security)
On flight CRS_ELAPSED_TIME Scheduled Total Flight Time (in Minutes) {1}
{2}
On flight ACTUAL_ELAPSED_TIME Actual Total Elapsed Flight Time (in Minutes) {2}
On flight AIR_TIME Actual Total Elapsed Time Airplane in the Air (in Minutes)
On flight DISTANCE Distance Between Departure and Arrival Airports (in Miles) {1}
{2}
On flight DISTANCE_GROUP Distance Between Departure and Arrival Airports in Number of 250-Mile increments Rounded Down (e.g. 400 miles is a value of 1) {1}
{2}
Delay CARRIER_DELAY Carrier Delay (in Minutes)
Delay WEATHER_DELAY Weather Delay (in Minutes)
Delay NAS_DELAY National Aviation System Delay (in Minutes)
Delay SECURITY_DELAY Security Delay (in Minutes)
Delay LATE_AIRCRAFT_DELAY Late Aircraft Delay (in Minutes)

TPOT Implementation

CPU-based

We firstly implemented what we have read in the TPOT paper - the CPU-based TPOT [1], the notebook could be found in this repo as CISC849_TPOT_CPU.ipynb. Our machine is Intel(R) Core(TM) i5-9600k CPU @ 3.70GHz

For feature selection, we kept several features listed in the table above. We encoded all categorical features into numerical values using label encoding. Training-testing data was splited by 75%/25%. We implemented TPOT with the following parameter tpot = TPOTClassifier(generations=5, population_size=40, verbosity=1, random_state=42, n_jobs = -1, warm_start = True, max_time_mins = 60).

After 6 hours running, the TPOT has selected a pipeline of KNeighborsClassifier(n_neighbors=12, p=1, weights='distance')) and achieved 91.89% accuracy. The exported python script can be viewed in tpot_airflight_pipeline_CPU.py.

GPU-based TPOT

We also found a article called Faster AutoML with TPOT and RAPIDS[4]. In this article, the author described that TPOT could be acceralted by GPU and achieve better performance with less time. We then have implemented GPU-acceralted TOPT on Google Colab. The GPU in Colab was Tesla T4.

We have also kept different features with previous CPU-TPOT, check above table for details. After one hour running, we have achieved 92.79% accuracy.

One hot encoding and data balancing

According to Buda (2018) comprehensive review on data imbalancing solutions [5], daba imbalancing could cause prediction inaccurate. In this data set, the cancelled flights only occupy ~10%, causing data imbalancing. Thus, we considered undersampling for data balancing. Additionally, we have tried one hot encoding for feature engineering. The code could be viewed in CISC849_TPOT_GPU.ipynb.

Manually Selecting a Machine Learning Pipeline

Within the dataset, the CANCELLED feature is binary (0 represents a flight that was not cancelled and 1 represents a flight that was cancelled). Due to this fact, I (Matt) decided to use machine learning algorithms that are good for supervised learning binary classification problems.

Packages Used:

For the core ML algorithms, we used scikit-learn. For plotting figures, we used seaborn and matplotlib. For dataset manipulations, we used pandas to transform the dataset into a dataframe.

Algorithms Selected:

  • Naive Bayes Classifier -- Within Naive Bayes, there is a threshold (50%) on how likely an instance is to be classified as a value (0 or 1) in a binary classification problem. We chose to use Naive Bayes because we believe it might mirror how flights are cancelled in reality. Looking at all factors (weather, arrival times, departure times, etc.), if an airline team determines there is a greater than 50% chance of cancelling the flight, it probably will get cancelled. We hoped Naive Bayes could emulate this decision making process.

  • Decision Tree Classifier -- For the Decision Tree Classifier we used GINI Impurity. GINI impurity measures the degreee of probability of a particular variable being wrongly classified when it is randomly chosen [3]. We chose to use a decision tree because it seems like a very logical way on how to cancel a flight. For example, if it is snowing out, the tree would follow a certain path that could not be followed if it was not snowing. Every path will lead to a decision, whether to cancel the flight or not.

  • K Nearest Neighbors Classifier -- We decided to use the KNN classifier because it also seems intuitive. If three flights, with very similar attributes, are all classified as cancelled and a fourth flight comes along with similar attributes to those three flights, there is a high probability that flight will also be classified as cancelled. We also think this might mirror how flight cancellations are made in real life. For example, if Southwest Airlines, Hawaiian Airlines, and JetBlue cancel their flights from Denver, CO to Honolulu, HI because of snow, American Airlines will most likely also do the same.

  • Random Forest Classifier -- Similar to the Decision Tree Classifier, we used the Random Forest Classifier as a "more powerful" decision tree. A Random Forest Classifier is basically a tree of decision trees. The tree that has the highest accuracy is then chosen as the tree the Random Forest Classifier uses.

Methodology

To start, we wanted to use the default classifiers scikit-learn offered. We figured based on the results from the default configurations, we could then tune the hyperparameters to increase our accuracy as needed. To divide up the dataset, we did so two ways. First, we did a regular train-test split, thus dividing our data into a training set and a testing set. We did this using scikit-learn's train_test_split function. Second, we used k-fold cross validation with k=5. We decided to test both ways because the dataset is extremely unbalanced. As a whole, the dataset only contains 282,926 cancelled flights, or just over 10% of the data given to us! By using k-fold cross validation, we can verify if the accuracy we achieve from the train-test split is, well, accurate.

Results using training set and testing set

Using the train-test split, we achieved the following results with only the default classifiers:

ML Method Accuracy
Naive Bayes Classifier 89.35%
Decision Tree Classifier 91.61%
K-Nearest Neighbor Classifier 90.94%
Random Forest Classifier 93.30%

Results using k-fold Cross Validation

Using 5-fold cross validation, we achieved the following results with only the default classifiers:

ML Method k=5 Accuracy k=10 Accuracy
Naive Bayes Classifier 73.63% 86.69%
Decision Tree Classifier 68.63% 67.44%
K-Nearest Neighbor Classifier 73.63% 86.69%
Random Forest Classifier 76.25% 75.40%

Conclusions

Our project had three goals: 1.) Accurately predict airline flight cancellations during the COVID-19 pandemic from Jan. 2020 - Jun. 2020 2.) Become familiar with TPOT and see if TPOT can give a good machine learning pipeline that will help us achieve Goal 1. 3.) Have one group member not use TPOT and not look at the TPOT results and see if they can create a better machine learning pipeline than TPOT. a.) It is worth noting that this goal is measured in terms of accuracy on the data.

We were able to achieve all three of these goals in the following ways. For Goal #1, we demonstrated that using TPOT and manually creating a machine learning pipeline, we were able to create a model that can predict airline flight cancellations during the COVID-19 pandemic. While the accuracy of our results varies, we have demonstrated that our models perform much better than a random guess. Second, we achieved goal number two by really exploring TPOT. Eric was able to get TPOT running on the CPU within Google Colab and his local machine. In addition to just using TPOT on the CPU, he also figured out how to run TPOT on a GPU using the cuML library from RAPIDS [6]. Using TPOT on the GPU sped up the time taken to create an accurate predictive model and it also predicted a different pipeline to use than the CPU run of TPOT. Finally, we were able to demonstrate that Matt created a machine learning pipeline that scored higher accuracy than the pipeline TPOT generated (which was goal #3). Coming together at the after each of us completed our parts offered us great insights into how to solve the problem at hand in a different way.

From this project, we learned how to use TPOT on both the CPU and GPU and learned how it creates effective machine learning pipelines. We also learned how to perform effective feature engineering on a dataset. If we had more time with this project, we would have liked to continue to find the most optimal features. Similarly, we learned to filter out which algorithms would be useful for this problem, and which would not be useful, by truly understanding the data in the dataset. For the future, we are interested in exploring if our model can generalize to future pandemics (and not just the COVID-19 pandemic) and seeing if we can generate more accurate results by further exploring the use of a one-hot-encoder on the dataset, performing additional feature engineering and pre-processing, and creating a method to make the dataset more balanced in terms of cancelled flights and flights that were not cancelled.

[1] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492.

[2] https://www.kaggle.com/akulbahl/covid19-airline-flight-delays-and-cancellations

[3] https://blog.quantinsti.com/gini-index/

[4] https://medium.com/rapids-ai/faster-automl-with-tpot-and-rapids-758455cd89e5

[5] Buda, M., Maki, A., & Mazurowski, M. A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, 249-259.

[6] https://github.com/rapidsai/cuml