Udacity Data Engineering Nanodegree - Project 2/6
Data Modeling with Cassandra
Introduction
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.
So we need a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions.
Project Description
The following applies in this project:
- Modeling an Apache Cassandra database
- Create tables that are aligned to the queries
- Insert the relevant data to the table
- Check if the Selects are correct
Requirements
This project was done on a Linux-OS (Ubuntu 20.04 LTS) with the source-code editor Visual Studio Code.
To implement the project you will need the following things:
- Python
- Apache Cassandra
- Jupyter
To work with Apache Cassandra and Python, you have to install the following module:
pip install cassandra-driver
Project Datasets
For this project, we'll be working with one dataset: event_data
. The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:
event_data/2018-11-08-events.csv
event_data/2018-11-09-events.csv