- The data will be modeled using a star schema in PostgreSQL, which consists of one fact table and four dimension tables.
- The star schema is suitable for this analysis as it simplifies queries and enables efficient aggregation of data.
The schema used in this project is as follows:
songplays - Records in log data associated with song plays (page: NextSong)
songplay_id
(INT) PRIMARY KEY: ID of each user song playstart_time
(TIMESTAMP) NOT NULL: Timestamp of user activity startuser_id
(INT) NOT NULL: ID of userlevel
(TEXT): User level (free/paid)song_id
(TEXT) NOT NULL: ID of played songartist_id
(TEXT) NOT NULL: ID of song artistsession_id
(INT): ID of user sessionlocation
(TEXT): User locationuser_agent
(TEXT): User agent used to access the Sparkify platform
users - Users in the app
user_id
(INT) PRIMARY KEY: ID of userfirst_name
(TEXT) NOT NULL: First name of userlast_name
(TEXT) NOT NULL: Last name of usergender
(TEXT): Gender of user (M/F)level
(TEXT): User level (free/paid)
songs - Songs in music database
song_id
(TEXT) PRIMARY KEY: ID of songtitle
(TEXT) NOT NULL: Title of songartist_id
(TEXT) NOT NULL: ID of song artistyear
(INT): Year of song releaseduration
(FLOAT) NOT NULL: Duration of song in milliseconds
artists - Artists in music database
artist_id
(TEXT) PRIMARY KEY: ID of artistname
(TEXT) NOT NULL: Name of artistlocation
(TEXT): Location of artistlatitude
(FLOAT): Latitude location of artistlongitude
(FLOAT): Longitude location of artist
time - Timestamps of records in songplays broken down into specific units
start_time
(TIMESTAMP) PRIMARY KEY: Timestamp of recordhour
(INT): Hour associated with start_timeday
(INT): Day associated with start_timeweek
(INT): Week of year associated with start_timemonth
(INT): Month associated with start_timeyear
(INT): Year associated with start_timeweekday
(TEXT): Name of weekday associated with start_time
The ETL (Extract, Transform, Load) pipeline is implemented in Python and follows the following steps:
-
Database and tables are created using the provided SQL statements in
sql_queries.py
. -
The song data is processed and loaded into the
songs
andartists
dimension tables. -
The log data is processed and loaded into the
time
andusers
dimension tables. -
The
songplays
fact table is populated by querying thesongs
andartists
tables to obtain the respective IDs based on song information. -
The ETL process is performed for all files in the song and log datasets.
-
The ETL pipeline is executed by running
etl.py
script.
The project includes the following files:
data
folder: Contains the JSON files of song and log data.Queries.py
: Includes all the SQL queries required for creating tables, inserting data, and performing other operations.Table_generator.py
: Drops and creates the tables defined in the schema using the queries fromqueries.py
.etl.py
: The Python script that performs the ETL process for all files in the song and log datasets. It reads the data, transforms it, and loads it into the appropriate tables.README.md
: The current file, which provides an overview of the project, its structure, and the ETL pipeline.
- Run
Table_generator.py
to create the database and tables required for the ETL process. - Verify the tables by running
test.ipynb
and checking the first few rows of each table. - Execute
etl.py
to perform the ETL pipeline and populate the tables with the data from the song and log datasets.
By following these steps, you will successfully create a database schema and perform the ETL process to analyze Sparkify's song and user activity data.