CSCI620 Big Data: Spotify Dataset Project
Project Description
This is an ongoing project worked on for the course "CSCI620: Introduction to Big Data" at Rochester Institute of Technology (Semester code: Spring 2225).
The project consists of three phases and a final presentation, the phases are briefly described below:
- Phase I: Select/create a large dataset, relational model for the dataset
- Phase II: Document-oriented model for the dataset
- Phase III: Datamining
This README file will only contain basic descriptions of each phases, for detailed write up of each phase, refer to the pdf write ups in the docs folder.
Contributors
- Vinod Dalavai | vd1605
- Ramprasad Kokkula | rk1668
- Samson Zhang | sz7651
Phase I
Dataset
Our dataset contains information on one million playlists created by Spotify users. The dataset is sourced from Kaggle, which can be found here.
The complete dataset is stored across a thousand json files, with each file containing a thousand playlists, which totals to a million playlists.
Each of the json files follow the naming pattern of:
mpd.slice.[starting playlist number] - [ending playlist number]
For example, mpd.slice.0-999
contains the first 1,000 playlists, note the use of 0-based numbering.
Relational Model: ER Diagram
Relational Model: Loading Data
The overview of data loading is explained in these steps:
- The json files are read, and a single csv file containing all the data is created.
- Specific columns from the csv file is used to create temporary csv files corresponding to each table of the schema.
- Temporary tables are created, these tables do not have any constraints to make initial data loading easier.
- The content of the temporary csv files is inserted into the corresponding temporary tables.
- The temporary tables are used to create tables that match the actual schema.
- The temporary csv files are deleted and the temporary tables are dropped.