A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.
The goal of this project is to create an Apache Cassandra database which can create queries on song play data to answer the questions of interest.
Generally speaking, in this project, we model the data by creating tables in Apache Cassandra to run queries. We are provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables.
In this project, we'll be working with one dataset: event_data
. The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:
event_data/2018-11-08-events.csv
event_data/2018-11-09-events.csv
event_data/2018-11-10-events.csv
event_data/2018-11-11-events.csv
event_data/2018-11-12-events.csv
...
There are 30 csv files in total in the event_data
file.
The whole dataset has been already uploaded to Dropbox.
-
Design tables to answer the queries outlined in the notebook.
-
Write Apache Cassandra
CREATE KEYSPACE
andSET KEYSPACE
statements. -
Develop the
CREATE
statement for each of the tables to address each question. -
Load the data with
INSERT
statement for each of the tables. -
Include
IF NOT EXISTS
clauses in theCREATE
statements to create tables only if the tables do not already exist. IncludeDROP TABLE
statement for each table, this way we can run drop and create tables whenever we want to reset the database and test the ETL pipeline. -
Test by running the proper select statements with the correct
WHERE
clause.
-
Implement the logic in the notebook to iterate through each event file in
event_data
to process and create a new CSV file in Python. -
Implement the Apache Cassandra
CREATE
andINSERT
statements to load processed records into relevant tables in the data model. -
Test by running
SELECT
statements after running the queries on your database.
There are still some of improvements that can be made in near future. For example,
- When printng our the query results, we can define a function to enclose this process.
- As this project was implemented in the classroom virtual envirnment, it is highly necessary to be tested in our local machine for better unstanding the whole process such as env setup, debugging and other playarounds.