chau0/song_data_pipeline

Jupyter Notebook

Overview

This project build an ETL pipeline load data from S3 and output to parquet format

Entity relation diagram

Dataset

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Source code files

etl.py reads and processes files from song_data and log_data and output to parquet format into S3.

How to run project

Run ETL process

python etl.py