Sparkify is a music streaming startup that has grown really fast for the past few months and now its services is known world wide.
The customer database became huge and brought new challenges to deliver diverse data in a time manner to business analysts. Also, new roles, such as data scientists, are going to work on that data.
Usage instructions: This Python nodebook is was run on AWS EMR notebook, might need to add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to run it from outside AWS
Data resides in two directories that contain files in JSON format:
s3a://udacity-dend/song_data : Contains metadata about a song and the artist of that song;
s3a://udacity-dend/log_data : Consists of log files generated by the streaming app based on the songs in the dataset above;
The data will be loaded from song_data s3 folder and saved as parquet file in to the s3 which will be later to pupulate the fact table
The data will be populated from logs_data folder in s3, and uses the songs and artists parquet files to create the fact table
Connect to your master using scp -i ~/yourkeypair.pem etl.py hadoop@ec2-3-21-129-236.us-east-2.compute.amazonaws.com:/home/hadoop/ ssh -i ~/yourkeypair.pem hadoop@ec2-3-21-129-236.us-east-2.compute.amazonaws.com spark-submit etl.py
use http://parquet-viewer-online.com/ to see the data in s3