Capstone Project- Classification, Predicted if the passenger's recommend airline to his friends or not.
Data includes airline reviews from 2006 to 2019 for popular airlines around the world with multiple choice and free text questions. Data is scraped in Spring 2019. The main objective is to predict whether passengers will refer the airline to their friends.
airline: Name of the airline.
overall: Overall point is given to the trip between 1 to 10.
author: Author of the trip
reviewdate: Date of the Review customer review: Review of the customers in free text format
aircraft: Type of the aircraft
travellertype: Type of traveler (e.g. business, leisure)
cabin: Cabin at the flight date flown: Flight date
seatcomfort: Rated between 1-5
cabin service: Rated between 1-5
foodbev: Rated between 1-5 entertainment: Rated between 1-5
groundservice: Rated between 1-5
valueformoney: Rated between 1-5
recommended: Binary, target variable.
An EDA is a detailed analysis designed to reveal a data set's underlying structure. It is significant for a business because it identifies trends, patterns, and linkages that are not intuitively clear.
Therefore, no outliers has been detected in dataset.
The most frequent airline in the dataset, Spirit Airlines, maintains the top spot for the number of flights, followed by American and United airlines. As shown below: The most frequent Aircraft in the dataset, Airbus A320, maintains the top spot for the number of flights, followed by Boeing 777 and Airbus A380 aircraft. According to the above analysis, Bangkok to Hong Kong journey with maximum frequency in dataset holds the tops position followed by Bangkok to London and London to New York. The month of July is said to be the one with the highest travel. The second-most popular month for travel is December. The three plots mentioned above made it easier for us to understand that the majority of travellers are Solo Leisure in travellers type column. For most passengers, the Economy class is the one they like in the cabin column. There is slight variation between recommended and not recommended in the recommended column.
All types of travellers strongly prefer the economy class. Some of the Business class and Couple Leisure people choose business class for travelling. First class is least preferred among all traveller type categories.
Airlines recent 5 year trend: #Multicollinearity: We can observe that a lot of rating variables have strongly correlated with the overall rating column. Therefore, we may ignore the remaining correlated columns and focus just on the overall column in order to optimize our analysis.
##Text Cleaning: Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so that machines can understand human language. Following approach is used here to clean the text of customer reviews:
- Use pos_tag with nltk:- POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context.
- Remove all character which are excluded from "a-z and A-Z".
- Convert words into Lowercase and split them through space.
- Remove stopwords using nltk library.
- Lemmatization of reviews and get the meaningful words using WordNetLemmatizer.
- Join back the words that were split before.
- Initiate tokenization process.
Logistic Regression performed best among all other algorithms for this particular type of dataset.