A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
In this project, I apply what I've learned on Spark and data lakes to build an ETL pipeline for a data lake hosted on S3. To complete the project, I will need to load data from S3, process the data into analytics tables using Spark, and load them back into S3. I'll deploy this Spark process on a cluster using AWS.
- drop unneccessary tables and create staging and trasform tables
- Load data files from s3, Song data: s3://udacity-dend/song_data and Log data: s3://udacity-dend/log_data into staging tables
- transform staging tables into final tables
- write final tables into s3
- Launch EMR Cluster and Notebook
- Step 1: Configure your cluster with the following settings
- Step 2: Wait for Cluster "Waiting" Status
- Step 3: Import notebook from this repo
- Step 4: Configure your notebook
- Step 5: Wait for Notebook "Ready" Status, Then Open
- Step 6: Run the code in the notebook
songplays
- songplay_id PRIMARY KEY
- start_time
- user_id FOREIGN KEY
- level
- song_id FOREIGN KEY
- artist_id FOREIGN KEY
- session_id FOREIGN KEY
- location
- user_agent
users
- user_id PRIMARY KEY
- first_name
- last_name
- gender
- level
songs
- song_id PRIMARY KEY
- title
- artist_id
- year
- duration
artists
- artist_id PRIMARY KEY
- name
- location
- lattitude
- longitude
time
- start_time PRIMARY KEY
- hour
- day
- week
- month
- year
- weekday