Sparkify is a startup company which is providing music streaming to their customers in sparkify music app. The Marketing Analytics team wants to understand their customers behaviour. Currently the team can not get the customers logs easily.
The log data and metadata are storing in JSON files. In this case data modeled in Postgres and ETL pipeline created with Python and dimension and fact tables created based on star schema.
- users: customers using sparkify app
- songs: Song related data. (Song Name, Artist, Album etc.)
- artist: Artist data (Artist Name, location etc.)
- time: time of customers logs. The timestamp divided into units (hour, week, time etc.)
- songplays: song plays logs
- Process Data in both song_data and log_data directories using Python and SQL
.
|-- create_tables.py ====> Create Dimension and Fact Tables with Postgre SQL
|-- data ================> Songs and Log Data in JSON Format
|-- etl.ipynb ===========> ETL Scripts Notebook for test
|-- etl.py ==============> ETL Scripts
|-- sql_queries.py ======> All CREATE, INSERT and SELECT queries
|-- test.ipynb ==========> Notebook for testing the ETL Scripts run correctly or not
os
glob
psycopg2
pandas
- install necessary libraries.
- run
create_tables.py
for creating database with SQL queries in .sql_queries.py
file. - run
etl.py
for process log and song data and transfer to the dimension and fact tables.