A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.
They'd like a data engineer to create a Apache Cassandra database with tables designed to optimize queries on song play analysis. The data engineer will create a database schema and ETL pipeline for this analysis. The database and ETL pipeline will be tested by running queries provided by the analytics team from Sparkify and compare the outcome with expected results.
The tables are created to answer the queries outlined in the project:
1 - Get the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4
2 - Get only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
3 - Get every user name (first and last) in music app history who listened to the song 'All Hands Against His Own'
song_plays_by_session - store song play activities by session
COLUMN | TYPE | NOTE |
---|---|---|
sessionId | INT | Primary Key |
itemInSession | INT | |
artist | TEXT | |
song | TEXT | |
length | DECIMAL |
song_plays_by_user - store song play activities by user
COLUMN | TYPE | NOTE |
---|---|---|
userId | INT | Primary Key |
sessionId | INT | |
itemInSession | INT | |
artist | TEXT | |
song | TEXT | |
firstName | TEXT | |
lastName | TEXT |
users_by_song - store user info by song title
COLUMN | TYPE | NOTE |
---|---|---|
userId | INT | Primary Key |
song | TEXT | |
firstName | TEXT | |
lastName | TEXT |
ETL pipeline is written in Python, which extract data from static text files, transform data to clean/proper data format, then load the data into related tables in the database.
The text files are in CSV format and contain data about songs, users, song play sessions/activities. The files are located in "event_data" directory.
CassandraDataModeling.ipynb - Jupyper notebook contains methods to extract data from CSV, create database, create tables, and insert data into database.
Th following must be installed and setup on local machine before running the scripts/files in this project:
- Appache Cassandra
- Python 2.7+
- Python Packages - Jupyter Notebook, Pandas, Cassandra Driver
Run the following commands in terminal:
jupyter notebook CassandraDataModeling.ipynb