use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin.
NOTE: Data has been filtered by droping all null values and filtering the data such that all products have a minimum of 20 votes and the helpful raiting is over 50 pecent.
- How many Vine reviews and non-Vine reviews were there?
- How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
- What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?
When looking at the statistics its hard to say if there is a bias. We are indeed showing that the 90 Vine reviews, of which 44 were 5 star raitings, percentages were indeed higher than that of non payed reviews. The 10 percent different does lend credence to the assumption that paid reviews contain bias but with such a small sample size it may not be best to draw conclusions just yet.
More data is needed or expand the filter to include 4 and 5 stars.