Amazon Review Sentiment Analysis Web Application

Rui Wang and Xiaojuan Tian, 2017-2018

Flask web app that uses API service to predict whether the product review is positive or negative.

Dataset

Amazon product data was provided by Julian McAuley, UCSD.

We cover the setiment analysis of 12 catergories of Amazon prodcuts. We list our results in test datasets using NBSVM (Naive Bayes - Support Vector Machine) inspired by a Kaggle kernel.

NBSVM was introduced by Sida Wang and Chris Manning in the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation. Here we use sklearn's logistic regression, rather than SVM, although in practice the two are nearly identical (sklearn uses the liblinear library behind the scenes).

Category	Precision (%)
Automotive	89.29
Baby	90.28
Clothing Shoes and Jewelry	90.50
Digital Music	88.92
Electronics	91.58
Grocery and Gourmet Food	90.33
Home and Kitchen	92.05
Kindle Store	92.57
Pet Supplies	89.32
Sports and Outdoors	91.19
Toys and Games	91.24
Video games	88.93

Installation

To install the Python packages for the project, clone the repository and run:

conda env create -f environment.yml

from inside the cloned directory. This assumes that Anaconda Python is installed.

# Build the model 
python build_model.py

# Run the App
python run.py

Test App

Open Browser: http://localhost:5000.

Choose the category of your purchased product, fill in your own reviews and get results like the following:

Data ETL

Format is one-review-per-line in (loose) json. See examples below for further help reading the data.

Sample review:

{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

where

reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
asin - ID of the product, e.g. 0000013714
reviewerName - name of the reviewer
helpful - helpfulness rating of the review, e.g. 2/3
reviewText - text of the review
overall - rating of the product
summary - summary of the review
unixReviewTime - time of the review (unix time)
reviewTime - time of the review (raw)

Here are the codes to read the data into a pandas data frame as Julian McAuley indicates:

import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Video_Games.json.gz')

Reference

[1] Wang, Sida, and Christopher D. Manning. "Baselines and bigrams: Simple, good sentiment and topic classification." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012.