/UD-DEND-Data-Modeling-with-Cassandra

Udacity Data Engineering Nanodegree Project 2 of 6 - Data Modeling with Cassandra

Primary LanguageJupyter NotebookMIT LicenseMIT

Udacity Data Engineering Nanodegree - Project 2/6

Made withJupyter made-with-python MIT license

Data Modeling with Cassandra


Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.

So we need a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions.

Project Description

The following applies in this project:

  • Modeling an Apache Cassandra database
  • Create tables that are aligned to the queries
  • Insert the relevant data to the table
  • Check if the Selects are correct

Requirements

This project was done on a Linux-OS (Ubuntu 20.04 LTS) with the source-code editor Visual Studio Code.

To implement the project you will need the following things:

To work with Apache Cassandra and Python, you have to install the following module:

  pip install cassandra-driver

Project Datasets

For this project, we'll be working with one dataset: event_data. The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:

event_data/2018-11-08-events.csv
event_data/2018-11-09-events.csv