/Airline-Passenger-Referral-Prediction

Developed natural language processing (NLP) model to extract sentiments from customer reviews through text pre-processing approach (Tokenization, Stemming, Lemmatization, TF-IDF Vectorizer). Logistic Regression, Gaussian NB, and Gradient Boosting were used to speed up the accuracy to 92%.

Primary LanguageJupyter Notebook

Airline-Passenger-Referral-Prediction

Capstone Project- Classification, Predicted if the passenger's recommend airline to his friends or not.

Probelm statement:-

Data includes airline reviews from 2006 to 2019 for popular airlines around the world with multiple choice and free text questions. Data is scraped in Spring 2019. The main objective is to predict whether passengers will refer the airline to their friends.

Feature descriptions briefly as follows:

airline: Name of the airline.

overall: Overall point is given to the trip between 1 to 10.

author: Author of the trip

reviewdate: Date of the Review customer review: Review of the customers in free text format

aircraft: Type of the aircraft

travellertype: Type of traveler (e.g. business, leisure)

cabin: Cabin at the flight date flown: Flight date

seatcomfort: Rated between 1-5

cabin service: Rated between 1-5

foodbev: Rated between 1-5 entertainment: Rated between 1-5

groundservice: Rated between 1-5

valueformoney: Rated between 1-5

recommended: Binary, target variable.

Dataset

1

EDA

An EDA is a detailed analysis designed to reveal a data set's underlying structure. It is significant for a business because it identifies trends, patterns, and linkages that are not intuitively clear.

Univariate Analysis:

Numerical Features:

imageimageimageimageimageimageimageimageimage

Outlier Detection

image Therefore, no outliers has been detected in dataset.

Categorical Features:

The most frequent airline in the dataset, Spirit Airlines, maintains the top spot for the number of flights, followed by American and United airlines. As shown below: image The most frequent Aircraft in the dataset, Airbus A320, maintains the top spot for the number of flights, followed by Boeing 777 and Airbus A380 aircraft. image According to the above analysis, Bangkok to Hong Kong journey with maximum frequency in dataset holds the tops position followed by Bangkok to London and London to New York. image The month of July is said to be the one with the highest travel. The second-most popular month for travel is December. image The three plots mentioned above made it easier for us to understand that the majority of travellers are Solo Leisure in travellers type column. For most passengers, the Economy class is the one they like in the cabin column. There is slight variation between recommended and not recommended in the recommended column. imageimageimage

Bivariate Analysis**:

All types of travellers strongly prefer the economy class. Some of the Business class and Couple Leisure people choose business class for travelling. First class is least preferred among all traveller type categories. image

Multivariate Analysis:

Airlines recent 5 year trend: image #Multicollinearity: We can observe that a lot of rating variables have strongly correlated with the overall rating column. Therefore, we may ignore the remaining correlated columns and focus just on the overall column in order to optimize our analysis. image

Natural Language Processing:

##Text Cleaning: Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so that machines can understand human language. Following approach is used here to clean the text of customer reviews:

  • Use pos_tag with nltk:- POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context.
  • Remove all character which are excluded from "a-z and A-Z".
  • Convert words into Lowercase and split them through space.
  • Remove stopwords using nltk library.
  • Lemmatization of reviews and get the meaningful words using WordNetLemmatizer.
  • Join back the words that were split before.
  • Initiate tokenization process. 2

Most Frequent words in customer review column:

image

Model performance:

3

Confusion matrix of test data

4

Auc-Roc Curve:

image

Model performance based on randomly created reviews:

5

Finally, interpreting the model through SHAP:

image

Conclusion:

Logistic Regression performed best among all other algorithms for this particular type of dataset.