udacity_data_engineering_datalake: A Python repository from Narengowda

Data lake using Spark

Sparkify is a music streaming startup that has grown really fast for the past few months and now its services is known world wide.

The customer database became huge and brought new challenges to deliver diverse data in a time manner to business analysts. Also, new roles, such as data scientists, are going to work on that data.

Usage instructions: This Python nodebook is was run on AWS EMR notebook, might need to add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to run it from outside AWS

Data Sources

Data resides in two directories that contain files in JSON format:

s3a://udacity-dend/song_data : Contains metadata about a song and the artist of that song;
s3a://udacity-dend/log_data : Consists of log files generated by the streaming app based on the songs in the dataset above;

Songs and Artists data processing

The data will be loaded from song_data s3 folder and saved as parquet file in to the s3 which will be later to pupulate the fact table

Logs data processing

The data will be populated from logs_data folder in s3, and uses the songs and artists parquet files to create the fact table

Usage:

Connect to your master using scp -i ~/yourkeypair.pem etl.py hadoop@ec2-3-21-129-236.us-east-2.compute.amazonaws.com:/home/hadoop/ ssh -i ~/yourkeypair.pem hadoop@ec2-3-21-129-236.us-east-2.compute.amazonaws.com spark-submit etl.py

use http://parquet-viewer-online.com/ to see the data in s3

Narengowda/udacity_data_engineering_datalake

Data lake using Spark

Data Sources

Songs and Artists data processing

Logs data processing

Usage: