musicaly

An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and perfroms analytics on clean data.

background

Eventsim is a top music streaming company. The management of Eventsim are working on a new feature tailored to the preferences of the users. In order to aid the development of this feature, the developers needed to understand certain things about the streaming habits of users. Hence, they came up with use cases and questions that need to be answered.

What is the total number of active users, heir total stream hours and their geographic distribution?
What is the general gender composition of users and how do they make up the top artists?
What are the top songs and who are the top artists that users listen to?

data flow

Eventsim API produces the streaming data which are then consumed by Kafka.
Stream data are read from Kafka with Spark Streaming.
Spark Streaming structures the data and writes to data lake (Cloud Storage) as flat file.
ELT from data lake (Cloud Storage) to data warehouse (BigQuery) using dbt, and orchestrated with Airflow
Stream Analytics were performed and deployed using Google Data Studio.

cloud architecture

data source

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

dashboard

Click here to view latest version on Data Studio

how to setup

⚠️ Note that GCP resources (which incur cost) are provisioned in this project

⚠️ Also this setup assumes you are using a linux or bash environment

clone this repo to the ~/musicaly-project directory

git clone https://github.com/topefolorunso/musicaly-project.git ~/musicaly-project && \
cd ~/musicaly-project

setup GCP account
provision infrastructure
ssh to and setup vms
proceed to run

how to run

start up the kafka service and start streaming here
start up the spark streaming service here
start up the airflow service here
connect bigquery to Data Studio for analytics