Amazon Home Products Customer Reviews API

If you sell consumer home products and you are looking for an API to support your web site, business dashboard, or research, this API is for you!

Here's a link to the dataset data dictionary that also includes the S3 links for all the customer review datasets.

Data Dictionary

Requirements

Git
Docker
Docker-compose

Installation

Open your terminal
Clone the repo git clone git@github.com:Shumakriss/dataset-to-api.git
Change directories cd dataset-to-api
Launch the Docker containers docker-compose up --build
Open your browser
Hit the test URL: http://localhost:8081/health
You should get a message "Server is running"
Try the rest of the API!

API

See below for the list of endpoints in this API. All responses formatted as JSON.

List Products

Method: GET
Returns: A paged list of products as lists [product_id, product_parent, product_title, product_category]
Endpoint: /products
Example: http://localhost:8081/products?keyword=Roomba
Example: http://localhost:8081/products?page=0&page-size=100
Query Parameters:

keyword: Wildcard match on titles
page-size: Limits the size of the list, capped at 100
page: The page index to retrieve

Reviews by Product

Method: GET
Returns: A list of all the reviews for a specific product ID
Endpoint: /reviews/<product_id>
Example: http://localhost:8081/reviews/B00EE62UAE

Health

Method: GET
Returns: 200 when server is available
Endpoint: /health
Example: http://localhost:8081/health

Internal Details

This project was built with the purpose of serving a useful API based on customer review data.

Some notable technical decisions:

Use of Docker and Docker-compose for reproducibility
Use of Python for rapid application development
Use of batch-style processing given a static dataset rather than a live source

Notable Python libraries include:

Boto3 - For Amazon S3
psycopg2 - For Postgres
Flask - For web services

Considerations

Maintainability

The project is structured with a docker-compose.yml and two subprojects, processor and web each with its own Dockerfile. Usage of infrastructure as code serves to improve reproducibility and to make deployment less dependent on environment.

User Experience

Processing is done as expediently as possible to limit user waiting while application loads. Reviews are committed periodically so that they may be queried while the data is loading. The API should be well-documented and intuitive.

Security (API key management)

Secrets may be provided via docker-compose and are not hardcoded (except certain defaults which should not be used in production). This could vary greatly depending on your organization, as you may have Docker secrets management or some other secrets management approach. Another major point of variation is authentication for the API which depends on your API token management system and user directories.

Documentation (End user and internal)

This guide should provide both API documentation and architectural insights.

Error handling

The code is generally organized to fail quickly. For example, a database health is done prior to major data operations to prevent wasting user time.

Testability

Some tests are included along with instructions to manually execute the scripts outside of Docker as well as to attach a database tool if necessary. Health check endpoints and API usage also serve as manual test points.

Processing approach (streaming, batch processing, etc.)

Batch-style processing is used due to the static nature of the dataset. A live, continuous data source would be better suited to a streaming approach.

Development

Additional Requirements

Python 3 & pip

Instructions

Open your terminal
Clone the repo git clone git@github.com:Shumakriss/dataset-to-api.git
Change directories to 'processor' folder cd dataset-to-api/processor
Install dependencies pip3 install --no-cache-dir -r requirements.txt
Change directories to 'web' folder cd ../web
Install dependencies pip3 install --no-cache-dir -r requirements.txt
Change to parent directory cd ..
Launch the containers

You may also run the processor.py ad web.py scripts in each corresponding folder to debug outside of Docker.

Data Model

The data model in Postgres is based on the Data Dictionary.

marketplace - 2 letter country code of the marketplace where the review was written.
customer_id - Random identifier that can be used to aggregate reviews written by a single author.
review_id - The unique ID of the review.
product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id.
product_parent - Random identifier that can be used to aggregate reviews for the same product.
product_title - Title of the product.
product_category - Broad product category that can be used to group reviews (also used to group the dataset into coherent parts).
star_rating - The 1-5 star rating of the review.
helpful_votes - Number of helpful votes.
total_votes - Number of total votes the review received.
vine - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline - The title of the review.
review_body - The review text.
review_date - The date the review was written.

Working with the database

If you want to work directly with the database, you can run pgAdmin. Be careful not to expose your database or credentials in production. For development purposes, add a pgAdmin container to your docker-compose.yml like so:

image: dpage/pgadmin4
environment:
  - PGADMIN_DEFAULT_EMAIL=user@domain.com
  - PGADMIN_DEFAULT_PASSWORD=SuperSecret
ports:
  - "80:80"

Possible Improvements & Features

Data quality (consistency checks, duplicates, format validation, etc.)
Integrate a more complete stack (API Gateway, user management, secrets management, etc.)
Supplement data (average star rating, rating histograms, top customers, etc.)
Further automated testing
Performance profiling & tuning
Additional product category datasets

Shumakriss/dataset-to-api