Udacity Data Engineering Nanodegree - Project 2/6

Data Modeling with Cassandra

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.

So we need a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions.

Project Description

The following applies in this project:

Modeling an Apache Cassandra database
Create tables that are aligned to the queries
Insert the relevant data to the table
Check if the Selects are correct

Requirements

This project was done on a Linux-OS (Ubuntu 20.04 LTS) with the source-code editor Visual Studio Code.

To implement the project you will need the following things:

Python
Apache Cassandra
- Here you can find an tutorial to install Apache Cassandra on Ubuntu
Jupyter

To work with Apache Cassandra and Python, you have to install the following module:

  pip install cassandra-driver

Project Datasets

For this project, we'll be working with one dataset: event_data. The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset: