๐Ÿ›ฉ๏ธ Airline Satisfaction Predictors ๐Ÿš

The airline industry is undeniably massive with an annual revenue exceeding $800B and 6M travelers per day. A team of four computer engineers from Cairo university have taken it upon themselves to find out a recipe for the perfect airline company by answering the paramount question โ€œWhat makes airline customers satisfied?โ€. The question is posed as both a data analysis problem and a machine learning problem that together answer it via exploratory analytics, association rule mining and predictive models.

๐Ÿš€ Pipeline

Our approach to said problem utilized the following pipeline image

๐Ÿ›ซ Data Preparation

We used Kaggle's Airline satisfaction dataset as shown below

Age Flight Distance Departure Delay in Minutes Arrival Delay in Minutes Gender Customer Type Type of Travel Class Inflight wifi service Departure/Arrival time convenient Ease of Online booking Gate location Food and drink Online boarding Seat comfort Inflight entertainment On-board service Leg room service Baggage handling Checkin service Inflight service Cleanliness
0 13 460 25 18.0 Male Loyal Customer Personal Travel Eco Plus 3 4 3 1 5 3 5 5 4 3 4 4 5 5
1 25 235 1 6.0 Male disloyal Customer Business travel Business 3 2 3 3 1 3 1 1 1 5 3 1 4 1
2 26 1142 0 0.0 Female Loyal Customer Business travel Business 2 2 2 2 5 5 5 5 4 3 4 4 4 5

Our data preparation module was implemented using Pandas and PySpark and supports the following:

  • Reading a specific split of the data (training or validation)
  • Reading specific column types from the data (numerical, ordinal or categorical)
  • Frequency Encoding for categorical features
  • Dropping missing values
  • Imputing numerical outliers Alternatives for the function were implemented as well in case any model required further special preprocessing.

๐ŸŽจ Exploratory Data Analytics

Instead of querying the dataset for specific facts, we priotized that it should tell us all the facts. In other words, we have let the data speak for itself and for that we designed the following analysis workflow

Analysis Stage Components
Univariate Analysis Prior Class Distribution
Basics Features Involved
Missing Data Analysis
Central Tendency & Spread
Feature Distirbutions
Feature Distributions per Class
Correlations & Associations Dependence between Categorical Features
Correlations between Categorical & Numerical Features
Monotonic Association between Ordinal Variables
Correlations between Numerical Features
Naive Bayes Assumption
Multivariate Analysis Separability & Distribution of Numerical Features
Separability & Distribution of Numerical Feature Pairs
Separability & Distribution of Numerical Feature Trios
Seprability and Distribution of Numerical and Categorical Pairs
Seprability and Distribution of Categorical Pairs

In the following we will take you through a cursory glance of the workflow, the set of insights that each visual corresponds to (over 50 insights in total) can be found in the demonstration notebook or the report along with full versions of the visuals and tables.

๐Ÿช‚ Univariate Analysis

โ—‰ Prior Class Distribution

In this we study if there is any imbalance among the satisfaction levels of people involved in the dataset.


โ—‰ Basics of Each Feature

This provides a description, type and possible values for each feature.

Variable Name Variable Description Variable Type Values
Gender Gender of the passengers Nominal Female, Male
Customer Type The customer type Nominal Loyal customer, Disloyal customer
Age The actual age of the passengers Numerical -
Type of Travel Purpose of the flight of the passengers Nominal Personal Travel, Business Travel
Class Travel class in the plane of the passengers Nominal Business, Eco, Eco Plus
Flight Distance The flight distance of this journey Numerical -
Inflight wifi service Satisfaction level of the inflight wifi service Ordinal 1, 2, 3, 4, 5
Departure/Arrival time convenient Satisfaction level of Departure/Arrival time convenient Ordinal 1, 2, 3, 4, 5
Ease of Online booking Satisfaction level of online booking Ordinal 1, 2, 3, 4, 5
Gate location Satisfaction level of Gate location Ordinal 1, 2, 3, 4, 5
Food and drink Satisfaction level of Food and drink Ordinal 1, 2, 3, 4, 5
Online boarding Satisfaction level of online boarding Ordinal 1, 2, 3, 4, 5
Seat comfort Satisfaction level of Seat comfort Ordinal 1, 2, 3, 4, 5
Inflight entertainment Satisfaction level of inflight entertainment Ordinal 1, 2, 3, 4, 5
On-board service Satisfaction level of On-board service Ordinal 1, 2, 3, 4, 5
Leg room service Satisfaction level of Leg room service Ordinal 1, 2, 3, 4, 5
Baggage handling Satisfaction level of baggage handling Ordinal 1, 2, 3, 4, 5
Check-in service Satisfaction level of Check-in service Ordinal 1, 2, 3, 4, 5
Inflight service Satisfaction level of inflight service Ordinal 1, 2, 3, 4, 5
Cleanliness Satisfaction level of Cleanliness Ordinal 1, 2, 3, 4, 5
Departure Delay in Minutes Minutes delayed when departure Numerical -
Arrival Delay in Minutes Minutes delayed when Arrival Numerical -
Satisfaction Airline satisfaction level Nominal Satisfaction, Neutral, Dissatisfaction

โ—‰ Missing Data Analysis

In this, we analyze each feature for missing data.

Age Flight Distance Departure Delay in Minutes Arrival Delay in Minutes Gender Customer Type Type of Travel Class Inflight wifi service Departure/Arrival time convenient Ease of Online booking Gate location Food and drink Online boarding Seat comfort Inflight entertainment On-board service Leg room service Baggage handling Checkin service Inflight service Cleanliness
Missing Count 0 0 0 310 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

โ—‰ Central Tendency and Spread

For each type of feature, we consider measures of central tendency and spread. This is an example for numerical features.

Age Flight Distance Departure Delay in Minutes Arrival Delay in Minutes
count 103904.000000 103904.000000 103904.000000 103594.000000
mean 39.379706 1189.448375 14.815618 15.178678
std 15.114964 997.147281 38.230901 38.698682
min 7.000000 31.000000 0.000000 0.000000
25% 27.000000 414.000000 0.000000 0.000000
50% 40.000000 843.000000 0.000000 0.000000
75% 51.000000 1743.000000 12.000000 13.000000
max 85.000000 4983.000000 1592.000000 1584.000000

โ—‰ Feature Distributions

Here we analyze the distributions of each feature to look for any special patterns.




โ—‰ Feature Distributions per Class

This provides the same analysis as above, but per class.




๐Ÿช‚๐Ÿช‚ Correlations

โ—‰ Dependence between Categorical Features

We have used the Chi-square test of independence which tests if there is a relationship between two categorical variables. In particular, we have that

$H_0$: The two categorical variables are independent.

$H_1$: The two categorical variables are dependent.

Here, we set $\alpha = 0.05$ and hence, if the p-value for the test done on two variables is less than $0.05$, we reject the null hypothesis and conclude that the two variables are dependent.

Gender Customer Type Type of Travel Class
Gender 0.0 0.0 0.026398 0.000119
Customer Type 0.0 0.0 0.0 0.0
Type of Travel 0.026398 0.0 0.0 0.0
Class 0.000119 0.0 0.0 0.0

โ—‰ Associationss between Categorical & Numerical Features

We have used Pearson's correlation ratio to find associations between all possible numerical and categorical features


โ—‰ Monotonic Association between Ordinal Variables

We have use Spearman's to find monotonic associations between ordinal variables


โ—‰ Correlations between Numerical Features

We have use Pearson's to study linear correlation between numerical variables


โ—‰ Naive Bayes Assumption

Related to dependence is also testing the Naive bayes assumption. We hae provided an automated way for that with the sample output:

As expected, the Naive Bayes assumption does not hold. In particular, we have that $$P(x_1, x_2, ...|C_1=0)=0.16$$ as computed numerically using the definition of the probability. Meanwhile, applying the Naive Bayes assumption we have that $$P(x_1, x_2, ...|C_1=0)=P(x_1|C_1=0)P(x_2|C_1=0)...=0.08$$ which is different from the correct probability.

Likewise, for the class $C_1=1$ we have that $$P(x_1, x_2, ...|C_1=1)=0.07$$ but $$P(x_1, x_2, ...|C_1=1)=P(x_1|C_1=1)P(x_2|C_1=1)...=0.04$$ which is different from the correct probability.

This does not stop up from using the model as research has demonstrated that it can be robust to the assumption.

๐Ÿช‚๐Ÿช‚๐Ÿช‚ Multivariate Analysis

โ—‰ Seperability and Distribution of Numerical Features

We prepared box plots to study the distribution of the numerical features for each class in the target


โ—‰ Seperability and Distribution of Numerical Pairs

Then we started analyzing all pairs to see if separability gets better


โ—‰ Seperability and Distribution of Numerical Trios

Then we considered all trios


โ—‰ Seprability and Distribution of Numerical and Categorical Pairs

We then sought interaction with all possible numerical features. The plot is much longer in the notebook.


โ—‰ Seprability and Distribution of Categorical Pairs

And interaction among all possible categorical features. You must have enough RAM to see the full version in the notebook.


๐Ÿ—ฟ Model Building & Evaluation

The abundance of categorical features and their significance as shown above has inspired considering Naive Bayes and Random Forest as predictive models. We later follow up with an SVM model as ordinal features can be assumed as numerical as well. But before employing such models we considered topping off our exploratory data analytics with association rule learning.

๐Ÿ”ฎ Apriori Model

For this, numerical features (excluding extremely skewed ones) were converted to categorical ones by binning and then all the categorical features were one-hot encoded.

The following shows a sample of the strongest rules found:

antecedents consequents support confidence lift
1 Class = Eco satisfaction = neutral or dissatisfied 0.366146 0.813862 1.436226
2 Type of Travel = Business travel, Customer Type = Loyal Customer satisfaction = satisfied 0.358793 0.705553 1.628201
3 Age = adult, Type of Travel = Business travel satisfaction = satisfied 0.388281 0.598531 1.381228

Where the corresponding graphic over all strong rules is


This is color plot of support against the confidence where the color represents lift and the number refers to a rule in the table.

๐ŸŽฒ Naive Bayes

For Naive Bayes, we started with preprocessing the data by converting the string categories to integers and bucketizing the four numerical variables based on the 10 percentiles (10%, 20%, 30%,...) so that they can be treated similar to categorical variables. We followed by implementing NaiveBayes on PySpark from scratch:

Since it holds by Bayes rule: $$P(C \mid A=(a_1,a_2,\ldots,a_M)) = \frac{P(A=(a_1,a_2,\ldots,a_M) \mid C)P(C)}{P(A=(a_1,a_2,\ldots,a_M))}$$

Since it holds by the naive assumption: $$P(A=(a_1,a_2,\ldots,a_M) \mid C) = \prod_{m=1}^{M} P(a_m \mid C)$$

By the constant denominator in Bayes: $$P(C_i \mid A)=(a_1,a_2,\ldots,a_M) \propto P(A=(a_1,a_2,\ldots,a_M) \mid C)P(C)$$

Hence, the most likely class is given by: $$C = \max_{1 \leq i \leq K} { P(C_i) \prod_{m=1}^{M} P(a_m \mid C_i)}$$

where $P(C_i)$ and $P(a_m \mid C_i)$ are easilt computed by counting.

This has yields an accuracy of $89.1%$ in terms of predicting customer satisfaction.

๐ŸŒฒ Random Forest

We also considered initiating a Random Forest model which did not further require any special processing (beyond NB). Luckily, PySparkโ€™s RandomForest inherently supports both categorical and numerical features after applying the model the perceived accuracy on the validation set was $96%$.

We analyzed the average feature importance set by trees in the forest to yield


๐Ÿ“ SVM Model

We topped off with a linear SVM model and hyperparameter search but results were not as significant as the random forest.

๐Ÿ›ฌ Result Interpreation

The totality of the analyses above have led us conclude the following:

  • There is a lack of satisfaction in airline travel experience (imbalance)
  • Such lack is focused on economy travelers
  • Wifi Service, Entertainment and OnlineBoarding are key determinants of satisfaction
  • Comfort and ease of booking also matter
  • Distance and delays seem to have a less adverse effect

