Amazon Review Sentiment Analysis Web Application
Rui Wang and Xiaojuan Tian, 2017-2018
Flask web app that uses API service to predict whether the product review is positive or negative.
Dataset
Amazon product data was provided by Julian McAuley, UCSD.
We cover the setiment analysis of 12 catergories of Amazon prodcuts. We list our results in test datasets using NBSVM (Naive Bayes - Support Vector Machine) inspired by a Kaggle kernel.
NBSVM was introduced by Sida Wang and Chris Manning in the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Here we use sklearn's logistic regression, rather than SVM, although in practice the two are nearly identical (sklearn uses the liblinear library behind the scenes).
Category | Precision (%) |
---|---|
Automotive | 89.29 |
Baby | 90.28 |
Clothing Shoes and Jewelry | 90.50 |
Digital Music | 88.92 |
Electronics | 91.58 |
Grocery and Gourmet Food | 90.33 |
Home and Kitchen | 92.05 |
Kindle Store | 92.57 |
Pet Supplies | 89.32 |
Sports and Outdoors | 91.19 |
Toys and Games | 91.24 |
Video games | 88.93 |
Installation
To install the Python packages for the project, clone the repository and run:
conda env create -f environment.yml
from inside the cloned directory. This assumes that Anaconda Python is installed.
# Build the model
python build_model.py
# Run the App
python run.py
Test App
Open Browser: http://localhost:5000.
Choose the category of your purchased product, fill in your own reviews and get results like the following:
Data ETL
Format is one-review-per-line in (loose) json. See examples below for further help reading the data.
Sample review:
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
where
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
Here are the codes to read the data into a pandas data frame as Julian McAuley indicates:
import pandas as pd
import gzip
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
df = getDF('reviews_Video_Games.json.gz')
Reference
[1] Wang, Sida, and Christopher D. Manning. "Baselines and bigrams: Simple, good sentiment and topic classification." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012.