Sparkify is a music streaming startup that has grown really fast for the past few months and now its services is known world wide.
The customer database became huge and brought new challenges to deliver diverse data in a time manner to business analysts. Also, new roles, such as data scientists, are going to work on that data.
The purpose of this project is to insert the current Data Warehouse into the Big Data reality.
Data lake will give the ability to deal with structured and unstructured data. It can effectively cooperate with data analysts to perform ad-hoc and fast data explorations. New types of analytics such as machine learning and natural language processing are also included in data lake capabilities.
Finally, a data lake shares the same goals of conventional Data Warehouses of supporting business insights, making it the data engineering response for Sparkify's new data challenges.
In the context of a data lake, the dimensional modeling will also continue to remain a valuable practice.
Data resides in two directories that contain files in JSON format:
- s3a://udacity-dend/song_data : Contains metadata about a song and the artist of that song;
- s3a://udacity-dend/log_data : Consists of log files generated by the streaming app based on the songs in the dataset above;
Analytics are best performed when data follows quality standards, so the following data quality actions were taken in this project:
- Blank spaces and zeros were replaced to
null
; - Duplicate removal on dimension tables. Special mention on users table where the most recent user interaction will be picked to be the best version in users_table, this is to ensure we have the last level status from each user.
- etl.py: Responsible for the orchestration of the entire data flow pipeline that will execute the extraction from JSON source files in S3, load schema-on-read tables and then transform data with DQ checks. Finally the program will load data into five separate tables in S3.