Check out the AWS Sentiment Streamlit App
This repository contains a project for sentiment analysis on Amazon Fashion customer reviews using various machine learning models. The project is implemented using Python and includes a Streamlit web application for model deployment and testing.
The main goal of this project is to analyze customer reviews from the Amazon Fashion dataset to determine sentiment. The project involves several steps, including data collection, data cleaning, feature engineering, model training, and deployment of the model using a Streamlit application.
- MongoDB
- Data Cleaning & Processing
- Vectorizer
- Topic Modeling
- Training Data
- Model Evaluation & Selection
- Streamlit App
- The dataset is created by merging the
raw_review_Amazon_Fashion
data with theraw_meta_Amazon_Fashion
data from AWS Public Datasets. - Reviews data can be fetched from MongoDB, while meta data is processed directly from the URL due to size limitations.
- Select columns to be included in the model.
- Remove rows with null values.
- Classify users based on their ratings into three groups:
- Ratings 4 and 5: Positive (1)
- Ratings 3: Neutral (2)
- Ratings 1 and 2: Negative (0)
- Stop Words Removal: Convert all text to lowercase and remove stop words.
- Regex Cleaning: Remove all numbers and non-alphabet characters.
- Correct Spelling: Fix spelling errors.
- POS Tagging: Tag parts of speech for lemmatization.
- Lemmatization: Reduce words to their base forms.
- Get Sentiment: Use the TextBlob library to include polarity and subjectivity scores in the model.
- Oversampling methods such as SMOTE can be used to handle data imbalance during model training.
- TFIDF Tokenizer: Applied with the following parameters:
stop_words='english'
: Automatically remove common English stop words (e.g., "and", "the", "is").min_df=0.008
: A term must appear in at least 0.8% of the documents to be considered.ngram_range=(1,3)
: Create unigrams (single words), bigrams (two-word combinations), and trigrams (three-word combinations).token_pattern="\\b[a-z][a-z][a-z]+\\b"
: Select words that are at least three letters long. This regex pattern captures words with three or more lowercase letters.
- Topic Modeling: Product titles are classified into three different topics using topic modeling.
- Review Length: The length of the review is included as a feature.
- Create dummy variables for string columns.
- Use GridSearchCV for model training.
- Evaluate different models including XGBClassifier, Random Forest, and Naive Bayes.
- The best model is selected based on accuracy and f1-score.
- Model comparison:
Model Accuracy f1 XGBClassifier 0.80 0.50 XGBClassifier_SMOTE 0.91 0.53 Random Forest 0.79 0.44 Naive Bayes 0.26 0.19
The model is deployed using a Streamlit web application, which can be accessed here.
To run the Streamlit app locally:
streamlit run app3.py