Sparkify Pipeline

This is a project demonstrating how to create a data mart for a music streaming app and persists data to it via an ETL data pipeline extracting information from JSON logs and loading it into a Postgres database.

It processes the JSON record with the Panda library.

TD;LR

To test the project locally

Install Jupyter Lab
Install PostgreSQL
Install Python3
Install Python Panda library
Checkout the code
Navigate to the root of the project
Run the command bash run_create_etl.sh which will create the database and populate it
Start the notebook by running the command jupyter notebook which will launch it in the browser
Open test.ipynb to run queries and view the data

What You're Getting

├── README.md - This file.
├── create_tables.py # Python script with all methods necessary to recreate the data mart.
├── etl.ipynb # Python based Jupyter Notebook describing and executing all tasks related to the extraction, tranformation and loading of the data.
├── run_create_etl.sh # Bash shell script executing the creation of the schema and the persistence of the data into it.
├── sql_queries.py # Python script defining the data mart schema and prepared/reusable queries.
└── test.ipynb # Python based Jupyter Notebook for veryfying if the database contains any data.

Contributing

Do not hesitate to submit a pull request.

R-K-1/etl-pipeline-postgres

Sparkify Pipeline

TD;LR

What You're Getting

Contributing