Udacity Project 4 - Data Lakes

Introduction

Sparkify's data and user population is growing rapidly and they need to move their data warehouse to a data lake. Currently, their data sits in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The goal of this project is to build an ETL pipeline that extracts Sparkify's data from S3, processes them using Apache Spark, and loads the data back into AWS S3 as Spark parquet files.

Schema for Song Play Analysis

Just like in previous projects for this course, we will be creating a star schema optimized for queries on song play analysis.

ETL (Extract, Transform, Load) Pipeline

Load AWS credentials (dl.cfg)
Read Sparkify data from S3
- song_data: s3://udacity-dend/song_data
- log_data: s3://udacity-dend/log_data
Process the data using Apache Spark.
- Transform the data and create five tables (see 'Tables' above)
Load data back into S3

S3 Buckets

Bucket 1 (song_data): s3://udacity-dend/song_data
- contains static data about artists and songs
Bucket 2 (log_data): s3://udacity-dend/log_data
- contains user data (who listened what song, when, where, etc)

Tables

Fact Table: songplays: attributes referencing to the dimension tables
Dimension Tables: users, songs, artists and time table

hamzaaitabbou96/Udacity-Data_Lake

Udacity Project 4 - Data Lakes

Introduction

Schema for Song Play Analysis

ETL (Extract, Transform, Load) Pipeline

S3 Buckets

Tables