amazon-product-data-pipeline: A Python repository from paclflst

Amazon Product Data Pipeline

This is a demo a data pipeline to import sample Amazon product data into postgres db

The solution consists of 4 parts:

Airflow to orchestrate the process and start jobs
Spark to process downloaded data and aggregate it for data marts
Postgres db to accomodate the data
Flask web api (simple application) to display transforamtion/aggregation results
Jupyter notebook to explore Spark transformations (uncomment in docker-compose file)

Start needed services with docker-compose (since the project contains Spark installation please allow at least 6GiB of memory for Docker needs)

$ docker-compose -f docker/docker-compose.yml up -d

To execute the import use Airflow web UI to trigger amazon-product-data-pipeline DAG:

or use command

$ docker-compose -f docker/docker-compose.yml run airflow-webserver \
    airflow trigger_dag amazon-product-data-pipeline

To check import results use simple web API to see top 5, bottom 5 and top 5 most imporved rating wise movies for a given month:

note that URL parameters can be provided:

http://localhost:5000/?month=2013-05&items_per_list=5

To connect to database and check the results use

$ psql -U airflow -h localhost -p 5432 -d test

then enter airflow

Solution logs can be accessed in logs folder

For Jupyter notebook, you must copy the URL with the token generated when the container is started and paste in your browser. The URL with the token can be taken from container logs using:

$ docker logs -f docker_jupyter-spark_1