This is a project on Big Data. This involves performing ETL (extract-transform-load) on Amazon product reviews and determine bias.
This challenge includes analyzing Amazon reviews written by members of the paid Amazon Vine program. The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. Companies like SellBy pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to publish a review.
Our dataset includes reviews of furniture. We used PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, we used PySpark to determine if there is any bias toward favorable reviews from Vine members in our dataset.
- Data Source: Amazon Review Datasets, Furniture Dataset
- Tools: AWS, Google Colab Notebook, Pyspark, PostgreSQL 12.9, pgAdmin 4
Using the cloud ETL process, I created an AWS RDS database with tables in pgAdmin, selected a dataset of furniture reviews from the Amazon Review datasets (Furniture Dataset), and extracted the dataset into a DataFrame. Then the DataFrame was transformed into four separate DataFrames that match the table schema in pgAdmin. Then the transformed data was uploaded into the appropriate tables. All steps are listed below:
- Create a new database with Amazon RDS.
- In pgAdmin, create a new database in the Amazon RDS server
- In pgAdmin, run a new query to create the tables for our new database. Now four new tables are created: customers_table, products_table, review_id_table, and vine_table
- Start a new Google Colab Notebook called 'Amazon_Reviews_ETL'. To use Pyspark, install spark and Java, set environment variables and start a spark session.
- Extract the 'furniture dataset' and create a new dataframe.
- Transform the dataset into four DataFrames that will match the schema in the pgAdmin tables.
- Create customers_table dataframe
- Create products_table dataframe
- Create review_id_table dataframe
- Create vine_table dataframe
- Load the dataframes to corresponding tables in pgAdmin
- Queries are run to check that the tables have been populated
- 'Amazon_Reviews_ETL' Google Colab Notebook was exported as an ipynb file, and available for review here: Amazon_Reviews_ETL.ipynb
In this section I worked on determining if there is bias towards reviews that were written as part of the Vine program. This task was completed by using PySpark.
- A Google Colab notebook called Vine_Review_Analysis was created and used to extract the furniture dataset
- Vine_table (from section 1) was recreated
- The data in Vine_table is filtered to create a DataFrame where there are 20 or more total votes
- The data is filtered again to create a DataFrame where the percentage of 'helpful_votes' is equal to or greater than 50%
- The data is filtered to create a DataFrame where there is a Vine review (paid reviews)
- The data is filtered to create a DataFrame where there isn’t a Vine review (unpaid reviews)
- The total number of reviews, the number of 5-star reviews, and the percentage 5-star reviews are calculated for all Vine and non-Vine reviews
- 'Vine_Review_Analysis' Google Colab Notebook was exported as an ipynb file, and is available for review here: Vine_Review_Analysis.ipynb
This analysis reveals the following:
-There are 136 Vine reviews as copared to 18,019 non-Vine reviews.
-There are 74 Vine revuews that were 5-stars as compared to 8,482 non-Vine 5-star reviews.
-There are 54.41% 5-star Vine reviews as compared to 47.07% non-Vine 5-star reviews.
- Since the 5-star percentage for Vine reviews is slightly greater than the non-Vine reviews, this analysis shows that there is slight bias toward favorable reviews from Vine members.
- I recommend that we should perfom the same analysis on 4-star ratings to see if there is a similar pattern because 4-star ratings contribute to positive reveiews as well.