udacity_data_cloud_data_lake

Purpose

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

In this project, I apply what I've learned on Spark and data lakes to build an ETL pipeline for a data lake hosted on S3. To complete the project, I will need to load data from S3, process the data into analytics tables using Spark, and load them back into S3. I'll deploy this Spark process on a cluster using AWS.

Data pipeline

drop unneccessary tables and create staging and trasform tables
Load data files from s3, Song data: s3://udacity-dend/song_data and Log data: s3://udacity-dend/log_data into staging tables
transform staging tables into final tables
write final tables into s3

How to run this?

Launch EMR Cluster and Notebook
Step 1: Configure your cluster with the following settings
Step 2: Wait for Cluster "Waiting" Status
Step 3: Import notebook from this repo
Step 4: Configure your notebook
Step 5: Wait for Notebook "Ready" Status, Then Open
Step 6: Run the code in the notebook

Database diagram

Fact table

songplays

songplay_id PRIMARY KEY
start_time
user_id FOREIGN KEY
level
song_id FOREIGN KEY
artist_id FOREIGN KEY
session_id FOREIGN KEY
location
user_agent

Dimentional table

users

user_id PRIMARY KEY
first_name
last_name
gender
level

songs

song_id PRIMARY KEY
title
artist_id
year
duration

artists

artist_id PRIMARY KEY
name
location
lattitude
longitude

time

start_time PRIMARY KEY
hour
day
week
month
year
weekday