Data Analysis & Predictive Modeling of Customer's Data

Web Scraping for Data Analysis
Predictive Modeling on Customer's Data
Conclusion
Libraries Utilized

Web Scraping for Data Analysis

Web Scraping

We utilized web scraping techniques to gather customer reviews and insights about Air India from Airline Quality. Extracted data includes customer comments, ratings, and relevant information, compiled into the "Reviews Dataset" for further analysis like predicting customer buying behaviors or understanding sentiments towards Air India's services.

Data Preprocessing

Data preprocessing involves cleaning, transforming, and integrating data to enhance its quality and suitability for analysis.

Data Cleaning

Removed sentences before '|' in the dataframe.
Eliminated special characters from the dataframe.

Tokenization

Text was tokenized to meaningful pieces.
Tokens were converted to tuples using POS Tagging and grouped into words through lemmatization.

Sentiment Analysis

Analyzing digital text to determine emotional tones (positive, negative, neutral).

VADER

VADER (Valence Aware Dictionary for Sentiment Reasoning) provided sentiment scores based on words used.
It's a rule-based sentiment analyzer that categorizes terms as positive, negative, or neutral.

Data Visualization

Using graphics to represent complex data relationships.

via. Matplotlib

Matplotlib, a comprehensive library, creates static, animated, and interactive visualizations.

via. WordCloud

Wordcloud visually represents word frequency in a text, where size indicates frequency.

Predictive Modeling on Customer's Data

Exploratory Data Analysis

Understanding the data, gaining insights, and identifying patterns.
Used Chardet library for UTF-8 encoded code, applied to CSV, and checked for null values.

Mutual Information Graphs

Visualizes feature relevance to the target variable, aiding feature selection.
scikit-learn (sklearn) calculates MI_score correlation between attributes.

Test and Train Model

Divided datasets into training and test sets for building and evaluating machine learning models.
Used Min-Max Scaling for consistent scaling across features.

via. Random Forest Classifier & XGB Classifier

Ensemble learning methods providing accurate and robust models.

Validate Model

Assessed model performance on unseen data.

Conclusion

The Random Forest classifier with the top 6 features exhibited slightly higher accuracy than XGBoost. It effectively predicts customer satisfaction or target variables.

Libraries Utilized

BeautifulSoup (bs4)
Chardet
Matplotlib
Natural Language Toolkit (nltk)
Numpy (np)
Pandas (pd)
Requests (re)
Seaborn (sns)
Scikit-learn (sklearn)
VaderSentiment (SentimentIntensityAnalyzer)
Warnings
WordCloud

ayam04/DA-runverve-task