Udacity Data Warehouse project

Description

In the current project, I (as a data engineer) worked with developing a data warehouse for an imaginary startup Sparkify. According to a legend, the startup wants to move their data and processes to the cloud service of AWS. My task assumed a development of an ETL pipeline that collects data from S3 and moves this data to a Redshift cluster. Inside of this cluster, I was tasked to organise the raw data into data models with an alignment to a star schema design.

Datasets

I worked with 2 datasets that resided in S3:

  1. Song dataset. The dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

  2. Log dataset. The dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Files

  • dwh.cfg. The config file that contains all important redshift database information

  • create_delete_cluster.py. The script has two main functions: creating and deleting Redshift cluster. When a user runs the sceript, they will be asked first if they want to create or delete cluster. In the case when a user chooses 'create', the script will also save newly created IAM role and an endpoint address to the file dwh.cfg

  • sql_quieries.py. The python file contains all SQL queries needed for dropping, creating and inserting operation

  • create_tables.py. The script connects to a Redshift cluster and creates staging tables for copying data from S3 and tables for a star schema

  • etl.py. The script connects to a Redshift cluster and will copy data from S3 repository to staging tables and then will insert values into tables created by the previous script

How to run the project

  1. Add personal KEY and SECRET and fill in all needed information for a future cluster in dwh.cfg
  2. Run create_delete_cluster.py and choose the option 'create'
  3. Run create_tables.py to create tables in a cluster
  4. Run etl.py to copy data from S3 and insert values into dimensional models
  5. Run create_delete_cluster.py and choose the option 'delete' to delete a cluster

Final tables

Fact Table

songplays - records in event data associated with song plays i.e. records with page NextSong (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)

Dimension Tables

users - users in the app (user_id, first_name, last_name, gender, level)

songs - songs in music database (song_id, title, artist_id, year, duration)

artists - artists in music database (artist_id, name, location, latitude, longitude)

time - timestamps of records in songplays broken down into specific units (start_time, hour, day, week, month, year, weekday)