Data Lakes using AWS S3 and Apache Spark

Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. In this project, we will build an ETL pipeline that extracts their data from AWS S3, process using Apache Spark, and transforms data into a parquet files fact table and a set of dimensional tables for analytics purpose.

Source

Our source data resides in S3, in a directory of JSON logs on user activity on the app named log_data, as well as a directory with JSON metadata on the songs in their app named song_data.

Song data : `s3a://udacity-dend/song-data/`

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are is an example of what a single json song file, looks like.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Data : `s3a://udacity-dend/log-data/`

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

Destination

The destination of the database is AWS S3 Bucket in the format of parquet files. Each of the five tables are written to parquet files in a separate analytics directory on S3. Each table has its own folder within the directory. Songs table files are partitioned by year and then artist. Time table files are partitioned by year and month. Songplays table files are partitioned by year and month.

Fact Table

songplays - records in event data associated with song plays i.e. records with page NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

user_id, first_name, last_name, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, lattitude, longitude

time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

ETL Approach

Create a user and grant appropriate permission for reading and writing from Amazon S3 resorce. Create a bucket on S3 called udacitydends (or any of your choice but name must be unique and make changes accordingly). This bucket will store the processed parquet files of the data-lake.

Song Data

Read the song data from s3a://udacity-dend/song-data/ S3 bucket.
Using apache spark, extract the songs and artists data.
Songs table files are partitioned by year and then artist.
While inserting the data into artists and songs, remove duplicate records and sort them by year.

Log Data

Read the log data from s3a://udacity-dend/log-data/ S3 bucket
Using apache spark, extract the data into users. By ranking all the users with descending order of ts and filtered them for rank 1.
Using apache spark, extract the data into time. We convert existing ts column into TIMESTAMP and extract the datetime attributes such as hour, day, week, month, yearand weekday, then Time table files are partitioned by year and month.
Using apache spark, extract the data into songplays by filtering the rows which has page == NextSong only while performing LEfT JOIN on songs table. Songplays table files are partitioned by year and month.

Run Instructions

Make sure your AWS S3 Bucket is active and credentials for data lake is in dl.cfg and are accurate.
From shell run etl.py

Note

Do not forget to Clean up your AWS resources.

vighneshanap/Build-a-Data-Lake-Using-AWS-S3-Spark-Cluster