A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. In this project, we will build an ETL pipeline that extracts their data from AWS S3, process using Apache Spark, and transforms data into a parquet files fact table and a set of dimensional tables for analytics purpose.
Our source data resides in S3, in a directory of JSON logs on user activity on the app named log_data
, as well as a directory with JSON metadata on the songs in their app named song_data
.
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are is an example of what a single json song file, looks like.
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The destination of the database is AWS S3 Bucket in the format of parquet files. Each of the five tables are written to parquet files in a separate analytics directory on S3. Each table has its own folder within the directory. Songs
table files are partitioned by year
and then artist
. Time
table files are partitioned by year
and month
. Songplays
table files are partitioned by year
and month
.
- songplays - records in event data associated with song plays i.e. records with page
NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- users - users in the app
- user_id, first_name, last_name, gender, level
- songs - songs in music database
- song_id, title, artist_id, year, duration
- artists - artists in music database
- artist_id, name, location, lattitude, longitude
- time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday
Create a user and grant appropriate permission for reading and writing from Amazon S3 resorce. Create a bucket on S3 called udacitydends
(or any of your choice but name must be unique and make changes accordingly). This bucket will store the processed parquet files of the data-lake.
- Read the song data from
s3a://udacity-dend/song-data/
S3 bucket. - Using apache spark, extract the
songs
andartists
data. Songs
table files are partitioned byyear
and thenartist
.- While inserting the data into
artists
andsongs
, remove duplicate records and sort them by year.
- Read the log data from
s3a://udacity-dend/log-data/
S3 bucket - Using apache spark, extract the data into
users
. By ranking all the users with descending order ofts
and filtered them for rank 1. - Using apache spark, extract the data into
time
. We convert existingts
column intoTIMESTAMP
and extract the datetime attributes such ashour
,day
,week
,month
,year
andweekday
, thenTime
table files are partitioned byyear
andmonth
. - Using apache spark, extract the data into
songplays
by filtering the rows which haspage == NextSong
only while performing LEfT JOIN onsongs
table.Songplays
table files are partitioned byyear
andmonth
.
- Make sure your AWS S3 Bucket is active and credentials for data lake is in
dl.cfg
and are accurate. - From shell run
etl.py
Do not forget to Clean up your AWS resources.