/Data-Warehouse-AWS

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.

Primary LanguageJupyter Notebook

Data Warehouse with AWS

Overview

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.

Datasets

There are 2 datasets that resides in S3 buckets.

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

Sample Song Record:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the song dataset. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset are partitioned by year and month.

Sample Log Record:

{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"","userId":"39"}

Schema

Fact Table

  • fact_songplay: The table records log data associated with the song played by users.

Dimension Tables

  • dim_user: Users of Sparkify app. Columns are user_id, first_name, last_name, gender, level.

  • dim_song: Songs in the dataset. Columns are song_id, title, artist_id, year, duration.

  • dim_artist: Artists in the dataset. Columns are artist_id, name, location, lattitude, longitude.

  • dim_time: timestamps of records in songplays broken down into specific units. Columns are start_time, hour, day, week, month, year, weekday.

How To Run

You should have an AWS account to run the project. You should setup the configuration file.

Create Redshift cluster by running create_cluster_iac.py.

$ python create_cluster_iac.py

Run create_tables.py to create the staging tables.

$ python create_tables.py

Run etl.py to to load data from staging tables to analytics tables on Redshift.

$ python etl.py

Now, you can run analytic queries on your Redshift database.

Delete the Redshift cluster by ruuning delete_cluster_iac.py

$ python delete_cluster_iac.py