https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction
Context:
This dataset contains an airline passenger satisfaction survey. What factors are highly correlated to a satisfied (or dissatisfied) passenger? Can you predict passenger satisfaction?
Content Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)
First I understand the data and did some EDA on it. Univariate and Bivariate analysis, Cleaned the data and did a Feature Engineering on it, Label Encoding, Upsamping, Removing Outliers. Then select the features that matter the most for the Machine Learning cycle.
It is binary classification problem so I decided to use K-Nearest Neigbors Classifier at first and after choosing the best value of K, auc = 0.9692 Then I used Decision Tree classifier, auc = 0.9795
Decided to use Ensemble Learning and start with Bagging, used Random Forest which gives an auc = 0.9923
Reduce number of features using Feature Importance on Decision Tree, I removed some features that is not that important according to Decision Tree and continue Ensemble Learning with boosting Techniues. Used XGboost classifier with certain hyperparameters and it gives auc = 0.9903
Then I tried Hyperparameter Tuning and used XGboost with Grid Search and cross validation to choose the best hperparameter and train the model on them .. then it gives auc = 0.9894 ! I also used stacking technique (Stacking Classifer) which works by splitting the data into training and validation sets. The base classifers (KNN-Random Forest-XGboost) are then trained on the training set and their predictions are made on the validation set. These predictions are used as input to the meta-classifer (LogisticRegression), which is trained on the validation set labels. Gives an auc = 0.9883
Also I used some Automated Machine Learning techniques like Hyperopt, Optuna, AutoML(AutoGluon).