Creating a data lake with Amazon S3 and EMR and Spark, running with PySpark.
Step By Step:
- Fill the
dl.cfg
file with configuration data from the Data Lake in S3 of my account. For security reasons, the file was not included/or empty; - Run the
etl.py
script, to create and organize the following:
- Select the sources and create the tables in PySpark;
- Read the data from the sources and insert in the tables.
- Create an output file to present the cleaned data.
The only command needed is:
python etl.py
Remember to create and update dl.cfg
.
The main ideia is to extract data from a DataLake in S3, transform it and to load on AWS EMR. Here is the table and content for the data:
Fact Table:
- songplays: songplays - records in log data associated with song plays i.e. records with page
NextSong
;
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables
- users: users in the app;
user_id, first_name, last_name, gender, level
- songs: songs in music database;
user_id, first_name, last_name, gender, level
- artists: artists in music database;
user_id, first_name, last_name, gender, level
- time: timestamps of records in songplays broken down into specific units;
start_time, hour, day, week, month, year, weekday