This project is part of the Udacity Nanodegree® on Data Engineering.
It aims to demonstrate database creation from a JSON store by creating appropriate database structures and populating correct data post ETL.
🧘🏽 “Data that is loved tends to survive.” - Kurt Bollacker
Start by creating the database using,
python create_tables.py
Start the ETL process using,
python etl.py
The data will then be loaded to the tables created and can be tested using the test.ipynb
notebook present in the repository.
Created using Creatly. Sorry for the attribute stacking in the SONGPLAYS
table, creately only allows for 60 elements in their free version!
The schema followed for this database is a star schema.
ARTIST_ID
is a duplicate column in SONGS
and ARTISTS
; this makes a more read oriented structure that is always prefered in analytics. Since the objective is to optimise analytical workflow, this structure supports our goal.
All tables have been ensured to have a PRIMARY KEY
.
Column datatypes and sizes have been marked as per observed business data.
LATITUDE
and LONGITUDE
have been capped till 5 digits accuracy after the decimal point to ensure ~cm accuracy
. (Source)
create_tables.py
- Helps in resetting/ creating the database and the tables within.sql_queries.py
- Queries to support resetting/ creating/ dropping and 'upsertion' of tables and data.etl.py
- Performs the reading of JSON files, type conversions where necessary and pumping into Postgres.etl.ipynb
- Interactive version of etl.pytest.ipynb
- Notebook to check if the data was correctly loaded.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.