/Amazon_Vine_Analysis

Analyzes book review data from Amazon and the Amazon-Vine program utilizing PySpark and Amazon Web Service's Relational Database Service (AWS RDS)

Primary LanguageJupyter NotebookMIT LicenseMIT

Overview

Amazon and Vine collaborated to create a paid subscription program for readers. To focus and refine marketing efforts for the program, we analyze all 114,00 book reviews to ascertain whether paid member reviews are positively biased compared to non-member reviews. To this end, we utilize Apache Spark and PySpark to initiate the extraction, transformation, and loading (ETL) process with cloud computing. Then we transfer the data into a PostgreSQL database with pgAdmin by creating and connecting to an Amazon Web Service's Relational Database Service (AWS RDS) instance.


Results

  • There were 5,012 Vine reviews;
  • There were 109,297 non-Vine reviews.

paid_reviews

unpaid_reviews

  • 2,031 Vine reviews were five stars;
  • 49,967 non-Vine reviews were five stars.

paid_5_stars

unpaid_5_stars

  • Approximately 40.52% of Vine reviews were five stars;
  • Approximately 45.72% of non-Vine reviews were five stars.

percentages

At-a-Glance

Vine Reviews Non-Vine Reviews
Total Reviews 5,012 109,297
Number of Five Stars 2,031 49,967
Percentage of Five Stars 40.52% 45.72%

Summary

Based on the calculations above, positivity bias from members of the Vine program is unlikely. The percentage of five-star Vine reviews was comparable to that of five-star non-Vine reviews. Additional analysis could determine the distribution of star ratings by calculating the percentages of Vine and non-Vine reviews at each star rating.


Resources

Data Source:

https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz

Software:

AWS RDS
Google Colaboratory Notebook
Apache Spark
PySpark
Python
PostgreSQL
pgAdmin
Hadoop
MapReduce
mrjob

Contact

Email: show.wang94@gmail.com

LinkedIn: showkatewang