AWS Sentiment Streamlit App

Check out the AWS Sentiment Streamlit App

AWS Sentiment Analysis with Amazon Fashion Customer Reviews

This repository contains a project for sentiment analysis on Amazon Fashion customer reviews using various machine learning models. The project is implemented using Python and includes a Streamlit web application for model deployment and testing.

Project Overview
Data Pipeline
Model Training
Model Evaluation
Streamlit Application
Usage

Project Overview

The main goal of this project is to analyze customer reviews from the Amazon Fashion dataset to determine sentiment. The project involves several steps, including data collection, data cleaning, feature engineering, model training, and deployment of the model using a Streamlit application.

Agenda

MongoDB
Data Cleaning & Processing
Vectorizer
Topic Modeling
Training Data
Model Evaluation & Selection
Streamlit App

Data Pipeline

MongoDB

The dataset is created by merging the raw_review_Amazon_Fashion data with the raw_meta_Amazon_Fashion data from AWS Public Datasets.
Reviews data can be fetched from MongoDB, while meta data is processed directly from the URL due to size limitations.

Data Cleaning & Processing

Select columns to be included in the model.
Remove rows with null values.
Classify users based on their ratings into three groups:
- Ratings 4 and 5: Positive (1)
- Ratings 3: Neutral (2)
- Ratings 1 and 2: Negative (0)

Review Processing Steps

Stop Words Removal: Convert all text to lowercase and remove stop words.
Regex Cleaning: Remove all numbers and non-alphabet characters.
Correct Spelling: Fix spelling errors.
POS Tagging: Tag parts of speech for lemmatization.
Lemmatization: Reduce words to their base forms.
Get Sentiment: Use the TextBlob library to include polarity and subjectivity scores in the model.

Handling Data Imbalance

Oversampling methods such as SMOTE can be used to handle data imbalance during model training.

Vectorizer

TFIDF Tokenizer: Applied with the following parameters:
- stop_words='english': Automatically remove common English stop words (e.g., "and", "the", "is").
- min_df=0.008: A term must appear in at least 0.8% of the documents to be considered.
- ngram_range=(1,3): Create unigrams (single words), bigrams (two-word combinations), and trigrams (three-word combinations).
- token_pattern="\\b[a-z][a-z][a-z]+\\b": Select words that are at least three letters long. This regex pattern captures words with three or more lowercase letters.

Feature Engineering

Topic Modeling: Product titles are classified into three different topics using topic modeling.
Review Length: The length of the review is included as a feature.
Create dummy variables for string columns.

Model Training

Use GridSearchCV for model training.
Evaluate different models including XGBClassifier, Random Forest, and Naive Bayes.

Model Evaluation

Results

The best model is selected based on accuracy and f1-score.
Model comparison:

Model Accuracy f1

XGBClassifier 0.80 0.50

XGBClassifier_SMOTE 0.91 0.53

Random Forest 0.79 0.44

Naive Bayes 0.26 0.19

Model	Accuracy	f1
XGBClassifier	0.80	0.50
XGBClassifier_SMOTE	0.91	0.53
Random Forest	0.79	0.44
Naive Bayes	0.26	0.19

Streamlit Application

The model is deployed using a Streamlit web application, which can be accessed here.

Usage

To run the Streamlit app locally:

streamlit run app3.py

belliogluyasemin/st-nlp