Data Lakes with Spark and AWS S3

Udacity Data Engineering Nanodegree

The purpose of this project is to extract songs and user logs data from AWS s3 from a music app called Sparkify, transform this data into a meaningful dimensional model with facts and dimensions on a Star Schema, and them load this data back into AWS S3 in parquet format.

In order to best fit the needs of the startup, 5 new tables were build with PostgresSQL with a Star Schema, as follows the image:
The songplay is the Fact Table of the Star Schema, which will be used to quickly retrieve information about users activity on the app, while the Dimension Tables users, time, songs and artists can be used to retreive the detailed information of entities present in the songplay table.
The raw data, songs and user logs are in JSON format on the /raw partition on AWS S3, and the resulting processed tables will reside on /processed on AWS S3.
The Storage Layer of this project will be AWS S3, and the Processing Layer will be a Spark Cluster with PySpark, the Spark Cluster will load raw data from AWS S3, process and create all 5 tables, than load the tables in parquet format back in AWS S3.

Now check your bucket partition with processed data and check if the tables were created. Mine were created at s3://udacity-de-files/processed.