In this project, I create a data warehouse using AWS Redshift and S3 and implement a pipeline to ingest S3 files into the database.
Two datasets are used:
- A song dataset, which is a subset of real data from the Million Song Dataset.
- A log dataset, which consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
Both files are stored in S3 bucket s3://udacity-dend
and are ingested into fact and dimension tables.
The project consists of the following files:
- redshift.py: helps manage creating and tear down of a redshift cluster.
- create_tables.py: helps create the Redshift fact and dimension tables.
- etl.py: helps run the ETL pipeline extracting facts and dimension data from the S3 files.
The final product of the pipeline consist of the following tables:
Fact Table
songplays - records in event data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables
users - users in the app
user_id, first_name, last_name, gender, level
songs - songs in music database
song_id, title, artist_id, year, duration
artists - artists in music database
artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday
Install prerequisites for pyscopg2 as defined in https://www.psycopg.org/docs/install.html
if you find an issue with -lssl, try running the following command:
env LDFLAGS="-I/usr/local/opt/openssl/include -L/usr/local/opt/openssl/lib" pip install psycopg2
per stackflow
Create the virtual environment:
python3 -m venv venv
Activate the virtual environment:
source venv/bin/activate
Install python packages:
pip3 install -r requirements.txt
In all sections below, it is assume the venv environment is activated.
This section assumes an admin role has been created in AWS and that AWS KEY
and SECRET
are stored in the aws.cfg
file.
Before starting the Pipeline, create a Redshift cluster using the following command:
python redshift.py --cmd create
This will create a Redshift cluster and assign it an S3 Read-Only role. The script relies on credential information from aws.cfg
as well as AWS cluster configuration information from file dhw.cfg
(see the files in this repository)
The command will create a cluster.cfg
file containing both the redshift hostname (as dwh_endpoint
) as well as the ARN (dwh_role_arn
) for the role created, which will be used by the pipeline scripts to access the S3 files.
To delete the cluster, run the following command:
python redshift.py --cmd delete
There are 2 steps:
- Create the Redshift tables, including staging as well as fact and dimension tables:
python create_tables.py
- Run the ETL pipeline:
python etl.py
At the end of the script, sample data will be printed to allow sanity check, and a couple tests are run to verify the content of the table.
Staging Tables
Fact Table
Dimension Tables
After running the pipeline, we are able to do some analysis on the songplay
table.
For example, we can find the top 5 power users with the following query:
WITH top_users AS (
SELECT user_id, COUNT(*) AS count
FROM songplay
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 5
)
SELECT users.first_name,
users.last_name,
top_users.cnt
FROM top_users
INNER JOIN users
ON users.user_id = top_users.user_id
ORDER BY cnt DESC
and we get:
first_name | last_name | cnt |
---|---|---|
Chloe | Cuevas | 41 |
Tegan | Levine | 31 |
Kate | Harrell | 28 |
Lily | Koch | 20 |
Aleena | Kirby | 18 |
We can also look for the top 5 most popular locations where songs are played, using the following query:
SELECT location,
count(*) AS cnt
FROM songplay
GROUP BY location
ORDER BY cnt DESC
LIMIT 5
location | count |
---|---|
San Francisco-Oakland-Hayward, CA | 41 |
Portland-South Portland, ME | 31 |
Lansing-East Lansing, MI | 28 |
Chicago-Naperville-Elgin, IL-IN-WI | 20 |
Atlanta-Sandy Springs-Roswell, GA | 18 |