Sparkify is a grower music streaming startup which aims to move their processes and data onto the cloud. Their data resides in S3, in two directories consisting of a JSON logs on the app users activities, as well as a JSON metadata with songs in their app.
This is an example of what a single song and log files looks like:
Each song json file contains information about the song and artist such as the title, artist Id, location, name and duration of the song.
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
Each line of this json brings information about the users (name, gender, location and account payment level), artists, song's title, page in the app and time.
This repository contains an ETL pipeline that extracts their data from those JSON files in S3, stages them in Redshift, and transforms data into a set of dimensional tables for the analytics team to continue finding insights in what songs their users are listening to, as can be seen below:
The data warehouse is based on PySpark (Apache Spark with Python) in order to optimize the ETL proccess in speed and flexibilty in management of the cloud-based-infraestructure. The Redshift choice give to the sparkify analytics team high perfomance in query proccess, avoid worries about hardware maintenance, the facility to move the data from S3 to Redshift and possibility to scale up clusters in future.
- Step 1
- Create a new [IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) in your AWS account.
- Give it [Administrator Access](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html) and Attach policies.
- Use access key and secret key to create clients for EC2, S3, IAM, and Redshift.
- Step 2
- See doc [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html).
- Create an IAM Role that makes [Redshift](https://aws.amazon.com/pt/redshift/) able to access S3 bucket (ReadOnly).
- Step 3
- See doc [Create Cluster](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-launch-sample-cluster.html).
- Create a RedShift Cluster and get the DWH_ENDPOIN(Host address) and DWH_ROLE_ARN and fill the config file.
- sql_queries.py: This file contains all the queries to create, insert data onto and drop the tables.
- create_tables.py: Contains functions to open connection with the cluster, drop the tables if they exist and create them.
- etl.py: Get data from S3, stage them into Redshift and insert into the final tables.
- dwf.cfg: Contains the parameters of the cluster, IAM role and path to JSON files.
python3 create_tables.py
python3 etl.py
SELECT sp.title, count(*) as count
FROM songplays sp
INNER JOIN songs s ON s.song_id = sp.song_id
GROUP BY s.title
ORDER BY count DESC, s.title ASC
LIMIT 10
SELECT ar.name, s.title, count(*) as count
FROM songplays sp
INNER JOIN songs s ON s.song_id = sp.song_id
INNER JOIN artists ar ON ar.artist_id = sp.artist_id
GROUP BY ar.name, s.title
ORDER BY count DESC, ar.name, s.title ASC;