Project1
Objective
- Do the modeling with the data of the startup company Sparkify in order to be able to analyze the songs what users are listening to.
- By data modeling, tables is created into Postgres database to optimize queries on song play analysis.
- ETL pipeline is created to insert data that was originall log and medatada from the app that was stored as JSON format.
Data modeling with Star schema
The tables are designed as star schema as below so that it is flexible to analyze.
- Fact table: Songplay
- Dimension table: Users, Songs, Artists, Time
DATASETS
- Subset of Million Song Dataset
- Logdata generated by an Event Log Simulator
Files included in the REPO:
- log_data: consists of files in JSON format generated by an event simulator based on the songs in the dataset. These simulate activity logs from a music streaming app, they are partitioned by year and month.
- song_data: consists of files in JSON format includes metadata about a song and the artist of that song. The files are seperated by the first three letters of each song's track ID.
- etl.ipynb which contains detailed steps of the ETL Process used to process the files and insert the data in the tables
- test.ipynb which have some test SELECT statements to show the data being successfully inserted to the table and includes the example queries for the database.
- sql_queries.py that includes the DB create & insert statements separeted in their own file for modularity
- create_tables.py that automatically drops the tables if they already exists and creates the tables as defined in the sql_queries module
- etl.py that contains the main program and manages the file processing needed for reading the files in JSON formats and inserting the data to the DB tables that was defined by the create_tables.py
STEPS TO RUN THE PROJECT
- Execute the "python create_tables.py" file in the Terminal to create all the DB tables.
- Execute the "python etl.py" file in the Terminal to insert all records in the tables.
- Open "test.ipynb" and run each cell from the top. Check the results of each table if the table is created and the data is inserted properly.
EXAMPLE QUERIES(included in example_query.ipynb)
To know how many times each gender had listened to songs.
SELECT u.gender, count(s.songplay_id)
FROM songplays s
JOIN users u
ON s.user_id = u.user_id
GROUP BY u.gender
Each artist's song played frequency
SELECT year, month, artist_name, count(songplay_id) cnt
FROM songplays s
JOIN time t
ON s.start_time = t.start_time
JOIN artists a
ON s.aritst_id = a.aritst_id
GROUP BY t.year, t.month, a.aritst_name
Free and Paid users number
SELECT level, count(*)
FROM songplays GROUP BY level