This is a demo a data pipeline to import sample Amazon product data into postgres db
The solution consists of 4 parts:
- Airflow to orchestrate the process and start jobs
- Spark to process downloaded data and aggregate it for data marts
- Postgres db to accomodate the data
- Flask web api (simple application) to display transforamtion/aggregation results
- Jupyter notebook to explore Spark transformations (uncomment in docker-compose file)
- Install Docker
- Install Docker Compose
Start needed services with docker-compose (since the project contains Spark installation please allow at least 6GiB of memory for Docker needs)
$ docker-compose -f docker/docker-compose.yml up -d
To execute the import use Airflow web UI to trigger amazon-product-data-pipeline DAG:
or use command
$ docker-compose -f docker/docker-compose.yml run airflow-webserver \
airflow trigger_dag amazon-product-data-pipeline
To check import results use simple web API to see top 5, bottom 5 and top 5 most imporved rating wise movies for a given month:
note that URL parameters can be provided:
http://localhost:5000/?month=2013-05&items_per_list=5
To connect to database and check the results use
$ psql -U airflow -h localhost -p 5432 -d test
then enter airflow
Solution logs can be accessed in logs folder
Airflow: http://localhost:8282/
Spark Master: http://localhost:8181/
Jupyter Notebook: http://127.0.0.1:8888/
- For Jupyter notebook, you must copy the URL with the token generated when the container is started and paste in your browser. The URL with the token can be taken from container logs using:
$ docker logs -f docker_jupyter-spark_1