Amazon review data served through PostgreSQL
Setup
CAUTION: This is only tested wit Book reviews.
- Download dataset from: http://deepyeti.ucsd.edu/jianmo/amazon/index.html
- Put dataset and meta file into ./import_job/datasets directory named:
- data.json.gz
- meta.json.gz
- Run
start.sh
- Run
./import_job/run.sh
- Check if import was successful at
- http://localhost:8080/?pgsql=db&username=postgres&db=dataset
- Username: postgres
- Password: postgres
How to query the data
We can consider each reviewer as a partition in our Federated Learning experiment. To get all partition IDs run:
SELECT DISTINCT data_json.values->>'reviewerID'
FROM data_json;
To get all reviews for a single reviewerID you might run
SELECT *
FROM data_json
WHERE data_json.values->>'reviewerID' = 'AGUJ97UTICV4W';
SQL Functions
To optimize some queries a few materialized views and SQL functions are created.
Get reviews by author ID:
Id range with the book reviews dataset is from 1
to 15362619
.
SELECT * FROM reviews_by_author_id(1);
WARNING: The numerical author id / string reviewerID mapping is not stable and even with the same base dataset you might end up with a different numerical id / reviewerID mapping after recreating the database. They are stable between queries though.
Get authors with min number of reviews:
SELECT * FROM authors_by_min_num_reviews(10);