/AWS_sparkify_etl_training

This repo will be used to store all data related to the Udacity Data Engineer project with AWS Intro module.

Primary LanguagePython

Udacity Data Engineering Nanodegree

Project Cloud Data Warehouse - Overview

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

This project generates the data structures in order to enable the sparkify analytics teams to research the DW and get insights, kpi's and dashboard generations.

The ETL process created in this project reads the sparkify json data on AWS S3, create the data structure and load the transformed data to it.

Datasets

Song Dataset:

It's a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

Sample Data:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

The log files in the dataset are partitioned by year and month.

Sample Data:

    {"artist":None,"auth":"Logged In","firstName":"Celeste","gender":"F","itemInSession":0,"lastName":"Williams","length":NaN,"level":"free","location":"Klamath Falls, OR","method":"GET","page":"Home","registration":1.541078e+12.0,"sessionId":438,"song":None,"status":200,"ts":1541990217796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebK..."","userId":"53"}

Executing

  • You'll need a redshift cluster running and fill the information needed at dwh.cfg file.

1st: Run the create_tables.py script

2nd: Run the etl.py script