spark-datalake-etl: A Python repository from yl-coder

How to run the ETL process.

import etl as e;
e.main();

This dataset contains analytical information about the usage of Sparkify users song play represented as a fact table using star schema. It also contains dimension data of user information, songs, artists and timestamps of records in songplay.
Sparkify oftens want to know where their user is located at, and what songs interest them. So that they can procure better song content with music provider.
Sparkify also wants to popular song and artist in the area so that they can do song recommendation to their user in similar location.

The data is modelled using star schema. Song play data is represented as a fact table and user information, songs, artists and timestamps of records in songplay is represented as dimension table.
The ETL pipeline consists of s3 and spark. The first step is for spark to read from s3 to perform transformation. The next step is to save the fact and dimension table into actual star schema tables in s3.