This repo is used for final project of CISC849, Fall 2020, a graduate level seminar offered at the University of Delaware. Our group consists of Matt Leinhauser and Eric (Yifan) Zhang.
We hosted a competition between automated machine learning pipeline generating tools and manually selected model. We have explored the usage of TPOT on both CPU and GPU, we also explicitly selected models from several basic machine learning algorithms. Besides the competition, we discussed data balancing technique to achieve better performance. This project could give an example of the comparison between auto ML tools and human selection on a same data science task.
This course requires each group reading a paper (or article) related to data science and presenting it formally to the entire class. We chose an article about TPOT [1], which is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
We uploaded the slides we used for the presentation, check TPOT Paper presentation.pptx
in this repo. In the slides, we describe and visualize the main technique involved in TPOT.
We were also required to conduct a final project for this course. The final project needs to relate interesting tasks to data science. First, we had to present a proposal about the final project. In this proposal, we identified the task, dataset, tentative technique/ML algorithms, and timeline.
We checked Kaggle and found a dataset called COVID-19 Airline Flight Delays and Cancellations - Which airlines have been the most affected by COVID-19?
[2]. It has the following description:
The United States Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. The data collected is from January - June 2020 and contains relevant flight information (on-time, delayed, canceled, diverted flights) from the Top 10 United States flight carriers for 11 million flights.
As COVID-19 is still impacting the entire world (at the time of writing, December 2020), this dataset provides some insights into how it affect the airline industry. We found this topic interesting and after discussion, we planned to use this dataset to try to predict the flight cancellations based on the information contained in this dataset.
We have presented TPOT as a useful automated tool to generate ML pipelines, one drawback of TPOT is that it requires relatively a lot of time for generating. So an idea came to mind: What about hosting a competition between TPOT and a manually-selected model?
One of our group members (Eric) focused on learning how to generate optimal machine learning pipelines using TPOT and the other group member (Matt) focused on manually selecting the best models.
Our project had the following goals: 1.) Accurately predict airline flight cancellations during the COVID-19 pandemic from Jan. 2020 - Jun. 2020 2.) Become familiar with TPOT and see if TPOT can give a good machine learning pipeline that will help us achieve Goal 1. 3.) Have one group member not use TPOT and not look at the TPOT results and see if they can create a better machine learning pipeline than TPOT. a.) It is worth noting that this goal is measured in terms of accuracy on the data.
There are 47 features for each data item, the dataset has a file called ColumnDescription.txt
to describe each feature. We analyzed all features by two methods: 1) Manually categorize them. 2) Compute and plot features correlations. We provide categories, feature names, descriptions, and whether or not kept in belowing table, where {1}
indicates CPU-TPOT, {2}
indicates GPU-TPOT.
Categories | Feature | Description | Kept |
---|---|---|---|
Time | YEAR | Year | |
Time | QUARTER | 1: Jan-Mar, 2: Apr-Jun, 3: Jul-Sep, 4: Oct-Dec | {1} {2} |
Time | MONTH | Month of Year | {1} {2} |
Time | DAY_OF_MONTH | Date of Month | {1} {2} |
Time | DAY_OF_WEEK | Day of Week (1: Monday, 7: Sunday) | {1} {2} |
Time | FL_DATE | Full flight date (M/DD/YYYY) | |
Identification | MKT_UNIQUE_CARRIER | Airline Carrier Code: AA: American Airlines AS: Alaska Airlines B6: JetBlue DL: Delta Air Lines F9: Frontier Airlines G4: Allegiant Air HA: Hawaiian Airlines NK: Spirit Airlines UA: United Airlines WN: Southwest Airlines |
{1} {2} |
Identification | MKT_CARRIER_FL_NUM | Flight Number | {1} {2} |
Identification | TAIL_NUM | Aircraft Tail Number (Usually starts with 'N') | {2} |
Location | ORIGIN | Flight Departure 3-Letter Airport Abbreviation | {1} {2} |
Location | ORIGIN_CITY_NAME | Flight Departure City, State Names | |
Location | ORIGIN_STATE_ABR | Flight Departure 2-Letter State Abbreviation | |
Location | ORIGIN_STATE_NM | Flight Departure State Name | |
Location | DEST | Flight Arrival 3-Letter Airport Abbreviation | {1} {2} |
Location | DEST_CITY_NAME | Flight Arrival City, State Names | |
Location | DEST_STATE_ABR | Flight Arrival 2-Letter State Abbreviation | |
Location | DEST_STATE_NM | Flight Arrival State Name | |
Departure | CRS_DEP_TIME | Scheduled Departure Time (HHMM) (Single or 2-Digit Values Represent 00:MM, e.g. 3 represents 00:03 or 12:03 AM) | {1} {2} |
Departure | DEP_TIME | Actual Departure Time (HHMM) | {2} |
Departure | DEP_DELAY | Departure Delay (Difference Between Actual Departure Time and Scheduled Departure Time in Minutes) | {2} |
Departure | DEP_DELAY_NEW | Departure Delay Ignoring Early Departures (Listed as 0) | |
Departure | DEP_DEL15 | Departure Delay Greater Than 15 Minutes (0: Not Greater Than 15, 1: Greater Than 15) | {2} |
Departure | DEP_DELAY_GROUP | Departure Delay in Number of 15-minute increments Rounded Down (e.g. Early Departure (< 0) is a value of -1, 30 or 42 minutes is a value of 2) | {2} |
Departure | DEP_TIME_BLK | Scheduled Departure Time in Hourly Block (HHMM) | {1} |
Departure | TAXI_OUT | Time between Airplane Taxi from Gate and Takeoff (WHEELS_OFF) Time (in Minutes) | |
Departure | WHEELS_OFF | Time of Airplane Takeoff (HHMM) | |
Arrival | WHEELS_ON | Time of Airplane Landing (HHMM) | |
Arrival | TAXI_IN | Time between Airplane Taxi to Gate and Landing (WHEELS_ON) Time (in Minutes) | |
Arrival | CRS_ARR_TIME | Scheduled Arrival Time (HHMM) (Single or 2-Digit Values Represent 00:MM, e.g. 3 represents 00:03 or 12:03 AM) | {1} {2} |
Arrival | ARR_TIME | Actual Arrival Time (HHMM) | {2} |
Arrival | ARR_DELAY | Arrival Delay (Difference Between Actual Arrival Time and Scheduled Arrival Time in Minutes) | {2} |
Arrival | ARR_DELAY_NEW | Arrival Delay Ignoring Early Arrivals (Listed as 0) | |
Arrival | ARR_DEL15 | Arrival Delay Greater Than 15 Minutes (0: Not Greater Than 15, 1: Greater Than 15) | {2} |
Arrival | ARR_DELAY_GROUP | Arrival Delay in Number of 15-minute increments Rounded Down (e.g. Early Arrival (< 0) is a value of -1, 30 or 42 minutes is a value of 2) | {2} |
Arrival | ARR_TIME_BLK | Scheduled Arrival Time in Hourly Block (HHMM) | {1} |
Cancellation | CANCELLED | 0: Flight Not Cancelled, 1: Flight Cancelled | {1} {2} |
Cancellation | CANCELLATION_CODE | Reason for Cancellation - if Cancelled, Letter Present (A: Carrier, B: Weather, C: National Aviation System, D: Security) | |
On flight | CRS_ELAPSED_TIME | Scheduled Total Flight Time (in Minutes) | {1} {2} |
On flight | ACTUAL_ELAPSED_TIME | Actual Total Elapsed Flight Time (in Minutes) | {2} |
On flight | AIR_TIME | Actual Total Elapsed Time Airplane in the Air (in Minutes) | |
On flight | DISTANCE | Distance Between Departure and Arrival Airports (in Miles) | {1} {2} |
On flight | DISTANCE_GROUP | Distance Between Departure and Arrival Airports in Number of 250-Mile increments Rounded Down (e.g. 400 miles is a value of 1) | {1} {2} |
Delay | CARRIER_DELAY | Carrier Delay (in Minutes) | |
Delay | WEATHER_DELAY | Weather Delay (in Minutes) | |
Delay | NAS_DELAY | National Aviation System Delay (in Minutes) | |
Delay | SECURITY_DELAY | Security Delay (in Minutes) | |
Delay | LATE_AIRCRAFT_DELAY | Late Aircraft Delay (in Minutes) |
We firstly implemented what we have read in the TPOT paper - the CPU-based TPOT [1], the notebook could be found in this repo as CISC849_TPOT_CPU.ipynb
. Our machine is Intel(R) Core(TM) i5-9600k CPU @ 3.70GHz
For feature selection, we kept several features listed in the table above. We encoded all categorical features into numerical values using label encoding. Training-testing data was splited by 75%/25%. We implemented TPOT with the following parameter tpot = TPOTClassifier(generations=5, population_size=40, verbosity=1, random_state=42, n_jobs = -1, warm_start = True, max_time_mins = 60)
.
After 6 hours running, the TPOT has selected a pipeline of KNeighborsClassifier(n_neighbors=12, p=1, weights='distance'))
and achieved 91.89% accuracy. The exported python script can be viewed in tpot_airflight_pipeline_CPU.py
.
We also found a article called Faster AutoML with TPOT and RAPIDS
[4]. In this article, the author described that TPOT could be acceralted by GPU and achieve better performance with less time. We then have implemented GPU-acceralted TOPT on Google Colab. The GPU in Colab was Tesla T4
.
We have also kept different features with previous CPU-TPOT, check above table for details. After one hour running, we have achieved 92.79% accuracy.
According to Buda (2018) comprehensive review on data imbalancing solutions [5], daba imbalancing could cause prediction inaccurate. In this data set, the cancelled flights only occupy ~10%, causing data imbalancing. Thus, we considered undersampling for data balancing. Additionally, we have tried one hot encoding for feature engineering. The code could be viewed in CISC849_TPOT_GPU.ipynb
.
Within the dataset, the CANCELLED feature is binary (0 represents a flight that was not cancelled and 1 represents a flight that was cancelled). Due to this fact, I (Matt) decided to use machine learning algorithms that are good for supervised learning binary classification problems.
For the core ML algorithms, we used scikit-learn
.
For plotting figures, we used seaborn
and matplotlib
.
For dataset manipulations, we used pandas
to transform the dataset into a dataframe.
-
Naive Bayes Classifier -- Within Naive Bayes, there is a threshold (50%) on how likely an instance is to be classified as a value (0 or 1) in a binary classification problem. We chose to use Naive Bayes because we believe it might mirror how flights are cancelled in reality. Looking at all factors (weather, arrival times, departure times, etc.), if an airline team determines there is a greater than 50% chance of cancelling the flight, it probably will get cancelled. We hoped Naive Bayes could emulate this decision making process.
-
Decision Tree Classifier -- For the Decision Tree Classifier we used GINI Impurity. GINI impurity measures the degreee of probability of a particular variable being wrongly classified when it is randomly chosen [3]. We chose to use a decision tree because it seems like a very logical way on how to cancel a flight. For example, if it is snowing out, the tree would follow a certain path that could not be followed if it was not snowing. Every path will lead to a decision, whether to cancel the flight or not.
-
K Nearest Neighbors Classifier -- We decided to use the KNN classifier because it also seems intuitive. If three flights, with very similar attributes, are all classified as cancelled and a fourth flight comes along with similar attributes to those three flights, there is a high probability that flight will also be classified as cancelled. We also think this might mirror how flight cancellations are made in real life. For example, if Southwest Airlines, Hawaiian Airlines, and JetBlue cancel their flights from Denver, CO to Honolulu, HI because of snow, American Airlines will most likely also do the same.
-
Random Forest Classifier -- Similar to the Decision Tree Classifier, we used the Random Forest Classifier as a "more powerful" decision tree. A Random Forest Classifier is basically a tree of decision trees. The tree that has the highest accuracy is then chosen as the tree the Random Forest Classifier uses.
To start, we wanted to use the default classifiers scikit-learn offered. We figured based on the results from the default configurations, we could then tune the hyperparameters to increase our accuracy as needed. To divide up the dataset, we did so two ways. First, we did a regular train-test split, thus dividing our data into a training set and a testing set. We did this using scikit-learn's train_test_split
function. Second, we used k-fold cross validation with k=5. We decided to test both ways because the dataset is extremely unbalanced. As a whole, the dataset only contains 282,926 cancelled flights, or just over 10% of the data given to us! By using k-fold cross validation, we can verify if the accuracy we achieve from the train-test split is, well, accurate.
Using the train-test split, we achieved the following results with only the default classifiers:
ML Method | Accuracy |
---|---|
Naive Bayes Classifier | 89.35% |
Decision Tree Classifier | 91.61% |
K-Nearest Neighbor Classifier | 90.94% |
Random Forest Classifier | 93.30% |
Using 5-fold cross validation, we achieved the following results with only the default classifiers:
ML Method | k=5 Accuracy | k=10 Accuracy |
---|---|---|
Naive Bayes Classifier | 73.63% | 86.69% |
Decision Tree Classifier | 68.63% | 67.44% |
K-Nearest Neighbor Classifier | 73.63% | 86.69% |
Random Forest Classifier | 76.25% | 75.40% |
Our project had three goals: 1.) Accurately predict airline flight cancellations during the COVID-19 pandemic from Jan. 2020 - Jun. 2020 2.) Become familiar with TPOT and see if TPOT can give a good machine learning pipeline that will help us achieve Goal 1. 3.) Have one group member not use TPOT and not look at the TPOT results and see if they can create a better machine learning pipeline than TPOT. a.) It is worth noting that this goal is measured in terms of accuracy on the data.
We were able to achieve all three of these goals in the following ways. For Goal #1, we demonstrated that using TPOT and manually creating a machine learning pipeline, we were able to create a model that can predict airline flight cancellations during the COVID-19 pandemic. While the accuracy of our results varies, we have demonstrated that our models perform much better than a random guess. Second, we achieved goal number two by really exploring TPOT. Eric was able to get TPOT running on the CPU within Google Colab and his local machine. In addition to just using TPOT on the CPU, he also figured out how to run TPOT on a GPU using the cuML library from RAPIDS [6]. Using TPOT on the GPU sped up the time taken to create an accurate predictive model and it also predicted a different pipeline to use than the CPU run of TPOT. Finally, we were able to demonstrate that Matt created a machine learning pipeline that scored higher accuracy than the pipeline TPOT generated (which was goal #3). Coming together at the after each of us completed our parts offered us great insights into how to solve the problem at hand in a different way.
From this project, we learned how to use TPOT on both the CPU and GPU and learned how it creates effective machine learning pipelines. We also learned how to perform effective feature engineering on a dataset. If we had more time with this project, we would have liked to continue to find the most optimal features. Similarly, we learned to filter out which algorithms would be useful for this problem, and which would not be useful, by truly understanding the data in the dataset. For the future, we are interested in exploring if our model can generalize to future pandemics (and not just the COVID-19 pandemic) and seeing if we can generate more accurate results by further exploring the use of a one-hot-encoder on the dataset, performing additional feature engineering and pre-processing, and creating a method to make the dataset more balanced in terms of cancelled flights and flights that were not cancelled.
[1] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492.
[2] https://www.kaggle.com/akulbahl/covid19-airline-flight-delays-and-cancellations
[3] https://blog.quantinsti.com/gini-index/
[4] https://medium.com/rapids-ai/faster-automl-with-tpot-and-rapids-758455cd89e5
[5] Buda, M., Maki, A., & Mazurowski, M. A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, 249-259.