Project1

Objective

Do the modeling with the data of the startup company Sparkify in order to be able to analyze the songs what users are listening to.
By data modeling, tables is created into Postgres database to optimize queries on song play analysis.
ETL pipeline is created to insert data that was originall log and medatada from the app that was stored as JSON format.

Data modeling with Star schema

The tables are designed as star schema as below so that it is flexible to analyze.

Fact table: Songplay
Dimension table: Users, Songs, Artists, Time

DATASETS

Subset of Million Song Dataset
Logdata generated by an Event Log Simulator

Files included in the REPO:

log_data: consists of files in JSON format generated by an event simulator based on the songs in the dataset. These simulate activity logs from a music streaming app, they are partitioned by year and month.
song_data: consists of files in JSON format includes metadata about a song and the artist of that song. The files are seperated by the first three letters of each song's track ID.

etl.ipynb which contains detailed steps of the ETL Process used to process the files and insert the data in the tables
test.ipynb which have some test SELECT statements to show the data being successfully inserted to the table and includes the example queries for the database.
sql_queries.py that includes the DB create & insert statements separeted in their own file for modularity
create_tables.py that automatically drops the tables if they already exists and creates the tables as defined in the sql_queries module
etl.py that contains the main program and manages the file processing needed for reading the files in JSON formats and inserting the data to the DB tables that was defined by the create_tables.py

STEPS TO RUN THE PROJECT

Execute the "python create_tables.py" file in the Terminal to create all the DB tables.
Execute the "python etl.py" file in the Terminal to insert all records in the tables.
Open "test.ipynb" and run each cell from the top. Check the results of each table if the table is created and the data is inserted properly.

EXAMPLE QUERIES(included in example_query.ipynb)

To know how many times each gender had listened to songs.

SELECT u.gender, count(s.songplay_id)
 FROM songplays s
 JOIN users u
 ON s.user_id = u.user_id
 GROUP BY u.gender

Each artist's song played frequency

SELECT year, month, artist_name, count(songplay_id) cnt
 FROM songplays s
 JOIN time t
 ON s.start_time = t.start_time
 JOIN artists a
 ON s.aritst_id = a.aritst_id
 GROUP BY t.year, t.month, a.aritst_name

Free and Paid users number

 SELECT level, count(*) 
 FROM songplays GROUP BY level

vegapunk-hiroshi/project1_postgres

Project1

Objective

Data modeling with Star schema

DATASETS

Files included in the REPO:

STEPS TO RUN THE PROJECT

EXAMPLE QUERIES(included in example_query.ipynb)

To know how many times each gender had listened to songs.

Each artist's song played frequency

Free and Paid users number