DataLake with Spark and S3+EMR

Creating a data lake with Amazon S3 and EMR and Spark, running with PySpark.

Process

Step By Step:

Fill the dl.cfg file with configuration data from the Data Lake in S3 of my account. For security reasons, the file was not included/or empty;
Run the etl.py script, to create and organize the following:

The only command needed is:

python etl.py

Remember to create and update dl.cfg.

The main ideia is to extract data from a DataLake in S3, transform it and to load on AWS EMR. Here is the table and content for the data:

Fact Table:

songplays: songplays - records in log data associated with song plays i.e. records with page NextSong;

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

user_id, first_name, last_name, gender, level

user_id, first_name, last_name, gender, level

user_id, first_name, last_name, gender, level

start_time, hour, day, week, month, year, weekday