/dend-project-4

Data Engineer Nanodegree - Project 4 - Data Lake with Spark

Primary LanguagePython

Project 4: Data Lake


Project Summary

Sparkify is a music streaming startup that has grown really fast for the past few months and now its services is known world wide.

The customer database became huge and brought new challenges to deliver diverse data in a time manner to business analysts. Also, new roles, such as data scientists, are going to work on that data.

The purpose of this project is to insert the current Data Warehouse into the Big Data reality.

Data Lake

Data lake will give the ability to deal with structured and unstructured data. It can effectively cooperate with data analysts to perform ad-hoc and fast data explorations. New types of analytics such as machine learning and natural language processing are also included in data lake capabilities.

Finally, a data lake shares the same goals of conventional Data Warehouses of supporting business insights, making it the data engineering response for Sparkify's new data challenges.

In the context of a data lake, the dimensional modeling will also continue to remain a valuable practice.

Data Sources

Data resides in two directories that contain files in JSON format:

  1. s3a://udacity-dend/song_data : Contains metadata about a song and the artist of that song;
  2. s3a://udacity-dend/log_data : Consists of log files generated by the streaming app based on the songs in the dataset above;

Data Quality Checks

Analytics are best performed when data follows quality standards, so the following data quality actions were taken in this project:

  1. Blank spaces and zeros were replaced to null;
  2. Duplicate removal on dimension tables. Special mention on users table where the most recent user interaction will be picked to be the best version in users_table, this is to ensure we have the last level status from each user.

Scripts Usage

  1. etl.py: Responsible for the orchestration of the entire data flow pipeline that will execute the extraction from JSON source files in S3, load schema-on-read tables and then transform data with DQ checks. Finally the program will load data into five separate tables in S3.