/Big_Data_Marketing_Analysis-AWS-Spark-SQL

Build Data Pipeline with pgAdmin, AWS Cloud and Apache Spark to Analyze and Determine Bias in Amazon Vine Reviews

Primary LanguageJupyter Notebook

Marketing Analysis with Big Data

Build Data Pipeline with pgAdmin, AWS Cloud and Apache Spark to Analyze and Determine Bias in Amazon Vine Reviews

Goals  •  Dataset  •  Tools Used  •  Analysis and Challenges  •  Results  •  Summary

Goals

Companies pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to publish a review. This project will analyze Amazon reviews written by members of the paid Amazon Vine program. The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. In this project, you’ll have access to approximately 50 datasets. Each one contains reviews of a specific product, from clothing apparel to wireless products.

This scope will cover the TV review dataset. First I'll use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, I'll use PySpark to determine if there is any bias toward favorable reviews from Vine members in your dataset.

Dataset

Amazon S3 bucket containing 50 review datasets.

Tools Used

  • Apache Spark: A unified analytics engine for large-scale data processing
  • Google Colab: Cloud based developer notebooks, used for testing scripts and performing complex calculations
  • Amazon Web Services: Cloud based services that performs many functions, hosting, data processing
    • AWS RDS: Relational Database service used for querying data in the cloud
    • AWS S3: Cloud file storage service
  • PGAdmin: Software used to build databases and analyze data with SQL

Analysis and Challenges

After the success of the SellBy project, our group will be running an analysis Amazon reviews written by members of the paid Amazon Vine program. I analyzed the TV review dataset and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. I then used PySpark to determine if there is any bias toward favorable reviews from Vine members in your dataset.

Below you will see dataframes I used to analyze the TV review data.

Review Data

Review Data

Review ID Table

Review ID Table

Customer Table

Customer Table

Product Table

Product Table

Vine Table

Vine Table

Results

Vine Reviews

Unpaid Reviews

Unpaid Reviews

  • In Total there were 255 Vine reviews and 22,675 unpaid reviews
  • Of the 255 Vine reviews, 103 were 5 star reviews (40%)
  • Of the 22,675 unpaid reviews, 10,310 were 5 star reviews (45%)

Summary

Based on the results of my analysis comparing Vine and unpaid reviews, I did not see evidence of positivity bias within the paid reviews. A higher percentage of unpaid reviews were 5 stars.

Here are some additional levels of analyis I am planning to apply to the current data set:

  • Compare the number of 1 star reviews between Vine and Unpaid to determine any additional patterns
  • Filter the Vine and Unpaid review datasets by verified purchase to add credibility to our review sample analysis

Back to top