This project allows the analysis of amazon food review dataset, which can be found here:
The dataset and all the other resources will be downloaded automatically as the project is executed.
- Create a bucket in S3 Storage
- Upload all the project files (Only the test files are required)
- Compile the project and upload the jar file to the bucket
- Setup a cluster (EMR). The project has been tested under this configurations: emr-5.30.1, m5.xlarge, Spark 2.4.5 - Zeppelin 0.8.2
- Connect through ssh to the master node
Example: ssh -i ~/prova.pem hadoop@ec2-34-205-140-244.compute-1.amazonaws.com
- Copy the jar file from the s3 bucket to the master node
Example: aws s3 cp s3://foodreview/FoodReview.jar .
To compile the project and create the jar file
$ cd FoodReviewFinal
$ sbt assembly
-
To execute it locally
$ export environment=local $ spark-submit FoodReview.jar
-
To execute it on aws (DEFAULT OPTION)
export path="path-to-s3-location-goes-here" $ spark-submit FoodReview.jar
In src/test/scala
we provided a set of tests, one for each operation allowed with the dataset provided and a totalTest.txt
with a mixin of operation.
Running the project, the default file loaded for test is totalTest.txt
. To prevent the default behaviour, set the environment variable like below:
$ export testFileName=testRecommendation.txt
Computes user recommended products, considering its previous reviews and ratings compared to similar users.
Note: If no userID is given in input, the program picks one random userID from the dataset
recommend (userID)
Computes the product ranking with the bayesian mean rating algorithm.
rank
Creates a csv about the evolution of rating average in time, with columns
| productId | year/month | avgRating |
There are two type of analysis: analysis in an interval of years and analysis in a given year by month.
The csv file are stored in "resources" folder, with schema: P year/month _ productID _ YEAR(S) _ YYYY (_YYYY).csv
Example: PY_B001BDDTB2_YEARS_2009_2012.csv
= Product time analysis, yearly based for productID
B001BDDTB2 between 2009 and 2012
evolutionY yearBegin yearEnd productID(s)
Note: to not provide the year of begin (end), replace them with a '-'. In this case we will consider the first(last) year recorded into the dataset.
evolutionM year productID(s)
Computes the helpfulness of users' reviews rank.
Note:
- If an user has an helpfulness score that is lower than the average (of the other users that gave the same rating to the same product), its score is incremented by adding the average score to the initial score and dividing by 2. The final helpfulness score is the average of the user helpfulness for the evaluated products.
Optional arguments:
- threshold: filter of helpfulness score for users greater than this value (default 0)
- limit: limit query by this number of rows (default 20)
helpfulness (userID) (threshold) (limit)
© Gabriele Calarota, Alberto Drusiani.