Sparkify is a startup company who provides music streaming services. They have JSON metadata on the songs in their app (s3a://udacity-dend/song_data/*) and user activity data (s3a://udacity-dend/log_data/*). The data is stored currently in S3. As their users are growing, Sparkify decided to move their data warehouse to a data lake.
Build an ETL pipeline which
- Extracts data from s3
- Processes the data using Spark according to the need of data analytics team
- Loads the transformed dimensional tables back to s3
Fact Table
songplays
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables
users
- user_id, first_name, last_name, gender, level
songs
- song_id, title, artist_id, year, duration
artists
- artist_id, name, location, latitude, longitude
time
- start_time, hour, day, week, month, year, weekday
- config.ini - stores the aws credentials
- etl.py - extracts the data from s3, processes them, creates new tables and stores them back to s3 as parquet
- queries.py - stores the queries used in etl to process and create tables
- test.py - tests whether the created tables includes the right columns and loaded correctly to s3
- emr_etl_script.py - script for running on emr cluster, it extracts the data from s3, processes them, creates new tables, stores them back to s3 and tests loaded tables have correct columns.
Before working with the big data, test the codes locally with a subset of the big data:
- Update the config.ini with your own aws key and secret
- (Optional) update the output_data path in etl.py according to your preference
- After having valid aws key and secret in config.ini, you can start the etl by typing the following commands in the command line:
$ python etl.py
- Step 3 will take a while. After that, test the outcome by typing the following commands in the command line:
$ python test.py
After testing the code locally, move on to emr cluster to etl the entire data sets:
-
Create a s3 bucket to store the output data and update the output_data path in emr-script
-
Create a emr cluster on aws and enable the security groups setting of the master EC2 instance to accept incoming SHH protocol from our local computer. The following is the example of creating emr cluster using aws cli:
--name sparkify \ --use-default-roles \ --release-label emr-5.20.0 \ --instance-count 4 \ --applications Name=Spark Name=Hive \ --ec2-attribute KeyName=YOUR_EC2_KEY_NAME,SubnetId=YOUR_SUBNET_ID \ --instance-type m5.xlarge
-
Copy the emr_etl_script to your EMR master node
scp -i PATH_TO_EC2_KEY PATH_TO_ETL_SCRIPT hadoop@YOUR_MASTER_NODE_DNS:/home/hadoop/
-
Connect to your master node
ssh -i YOUR_EC2_KEY_PATH hadoop@YOUR_MASTER_NODE_DNS
-
Submit your script
usr/bin/spark-submit --master yarn PATH_TO_THE_SCRIPT
-
Terminate your emr cluster. You should find all the tables stored in your chosen ouput_data path
The codes are written by me and the sparkify data belongs to Udacity.