We utilized web scraping techniques to gather customer reviews and insights about Air India from Airline Quality. Extracted data includes customer comments, ratings, and relevant information, compiled into the "Reviews Dataset" for further analysis like predicting customer buying behaviors or understanding sentiments towards Air India's services.
Data preprocessing involves cleaning, transforming, and integrating data to enhance its quality and suitability for analysis.
- Removed sentences before '|' in the dataframe.
- Eliminated special characters from the dataframe.
- Text was tokenized to meaningful pieces.
- Tokens were converted to tuples using POS Tagging and grouped into words through lemmatization.
Analyzing digital text to determine emotional tones (positive, negative, neutral).
- VADER (Valence Aware Dictionary for Sentiment Reasoning) provided sentiment scores based on words used.
- It's a rule-based sentiment analyzer that categorizes terms as positive, negative, or neutral.
Using graphics to represent complex data relationships.
Matplotlib, a comprehensive library, creates static, animated, and interactive visualizations.
Wordcloud visually represents word frequency in a text, where size indicates frequency.
- Understanding the data, gaining insights, and identifying patterns.
- Used Chardet library for UTF-8 encoded code, applied to CSV, and checked for null values.
- Visualizes feature relevance to the target variable, aiding feature selection.
- scikit-learn (sklearn) calculates MI_score correlation between attributes.
- Divided datasets into training and test sets for building and evaluating machine learning models.
- Used Min-Max Scaling for consistent scaling across features.
Ensemble learning methods providing accurate and robust models.
Assessed model performance on unseen data.
The Random Forest classifier with the top 6 features exhibited slightly higher accuracy than XGBoost. It effectively predicts customer satisfaction or target variables.
- BeautifulSoup (bs4)
- Chardet
- Matplotlib
- Natural Language Toolkit (nltk)
- Numpy (np)
- Pandas (pd)
- Requests (re)
- Seaborn (sns)
- Scikit-learn (sklearn)
- VaderSentiment (SentimentIntensityAnalyzer)
- Warnings
- WordCloud