Spotify's Rap Caviar playlist ETL pipeline using AWS (...& Databricks)

Using an ETL pipeline to investigate the change in hip-hop/rap genre over time

(Remind me to check on this repo in a year so I can provide a fancy graph).

Introduction
Installation
References

Introduction

Purpose

This project investigates any change in hip-hop/rap over time based on Spotify's playlist, 'Rap Caviar's various audio feature statistical values (i.e. mean, std, max & min) using an ETL process on Amazon Web Services (AWS) and Databricks.

Background

Rap Caviar is one of the most popular playlists on Spotify. It consists of songs mainly from the hip-hop/rap genre. One could argue that this playlist is a good representation of the hip-hop/rap genre. As with any music genre, it changes and develops over time based on various factors such as culture, technology, influences of other genres, etc.

When you upload a song to Spotify, assigns various audio features to the song which was used for automated song recommendations. The playlist's audio features were based on the statistical values of various the audio features from the tracks in the playlists. This was achieved by creating an ETL pipeline using AWS and Databricks, which scrapes and transforms this data (audio features) from Spotify's API every week. After a significant amount of weeks have passed, one will be able to see any changes within the genre.

Technologies

AWS Lambda
AWS S3
AWS Identity and Access Management (IAM)
AWS Cloudwatch
Databricks
Python
PySpark

ETL Pipeline

Data Model

(Return to Table of Contents)

Installation

Spotify API

First, need to create an account with Spotify for Developers. Thereafter, create an app to get your account's Client ID and Client Secret ID.

Identity and Access Management (IAM)

AWS Access Key and Secret Key are required for Databricks to read files within the S3 bucket in your AWS account. Create a user under the 'Access management-Users' tab. Once the user has been created, download the credential information of your account (in a .csv format).

Configurations to be used when making user:

AWS type: Programmatic access
Permissions: 'AmazonS3FullAccess'

Lambda Functions

Both lambda functions extract data from the Spotify API. Both lambda function scripts can be found in the repo. Upload to respective lambda function environment in the ZIP format provided. ZIP files contain the code as well as the required Python packages. The image below displays the outcome of the upload and details regarding each function are as follows:

Lambda function 1

Name: Spotify_playlist_items_function
Function: Extracts song information from the 'Rap Caviar' playlist (refer to data model)
Trigger: CloudWatch
Runtime: Python 3.9
Architecture: x86_64
Timeout time: 1min (Previously 3s)

Lambda function 2

Name: Spotify_audio_features_function
Function: Extracts audio features from a song list. In this case from the 'Rap Caviar' playlist (refer to data model)
Trigger: S3 Bucket (Event Type: PUT)
Runtime: Python 3.9
Architecture: x86_64
Timeout time: 1min (Previously 3s)

The main notion of both Lambda functions is it extracts data from Spotify API, stores that info in a CSV file, and uploads to or reads from an S3 bucket. The spotipy package is used to gain access to Spotify API. Place your Client ID and Client Secret ID in the provided fields.

from spotipy.oauth2 import SpotifyClientCredentials

  
CLIENT_ID= "your ID here"
CLIENT_SECRET = "your ID here"

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
					client_id=CLIENT_ID, 
					client_secret= CLIENT_SECRET)
					)

AWS Lambda did not have all the Python packages to run the code. Therefore required packages were required to be placed in the environment as well in a layer. The following Python packages were required:

redis
spotipy
requests

Two layers are required to be placed on both Lambda functions: A layer consisting of the relevant Python packages and the AWS layer, 'ÁWSDataWrangler-Python39' which is already created. When creating the Python package layer, use the 'python packages.zip' zip file in the repo.

(Return to Table of Contents)

Triggers

Cloudwatch (EventBridge)

A rule is required to trigger the first lambda function i.e. "Spotify_playlist_items_function" activates it every week (7 days). Refer below for the rule configuration.

S3 Bucket

As shown in the ETL pipeline, the second lambda function, 'Spotify_audio_features_function' takes some of the output i.e. CSV file from the first lambda function, 'Spotify_playlist_items_function' and uses it to retrieve the audio feature characteristics of all the songs from the playlist. Thus we can use this S3 bucket to trigger the second lambda function. See below regarding the s3 trigger details. Note that it is a PUT event type.

(Return to Table of Contents)

Databricks

Databricks community edition, a free version provided by Databricks, was used for this project. As mentioned previously, the AWS account's credential information file (containing access and secret key) was retrieved from AWS IAM and will be used to mount the S3 bucket to the Databricks cluster to read files. Create your cluster, thereafter, your AWS credential information file can be uploaded to Databricks. Click the Data icon and then click the Create Table button. Drag or upload your credential information file onto Databricks. See below for storage information. Note the DBFS Target Directory. This will be the location of the credential information file.

The code was written using Pyspark. Create a notebook and copy code from 'Databricks_ETL_Pyspark.py" in the repo. For Databricks to gain access to the data, the S3 bucket will have to be mounted on the cluster. This will only need to be done once per cluster.

dbutils.fs.mount(SOURCE_URL, MOUNT_NAME)

The main idea of this code is to mount/extract the S3 bucket onto the Databricks cluster, transform the data by conducting basic statistical aggregate functions on each audio feature and load each as a CSV file. The ETL process of Databricks is not automated possibly due to the free subscription limitations. CSV files for each audio feature will have to be downloaded. Refer to the image for details.