- Project charter
- Directory structure
- Setting up environment variables
- Extracting data from Billboard and Spotify
- Executing model pipeline with Makefile
- Execute each step of model pipeline
- Running the application
- Running pipeline and app in Docker
- Backlog
Scenario:
- We are data science consultants; with our proprietary music information retrieval (MIR) platform, we develop data-driven solutions to tackle problems in the music industry
- Rap/Hip-Hop is perhaps the most dynamic and influential music genres that exists today – it challenges social norms and pushes creativity in music production, and revitalize music across all genres and time periods through sampling
- A record label has approached us to help them expand their Rap/Hip-Hop division
Vision: Because the label receives hundreds of music files everyday, they are looking for an automated way to prioritize the review of songs with more Rap/Hip-Hop influences
Mission: Use music attributes (e.g., tempo, valence, duration) to predict the probability that a given song is a rap/hip-hop song
Data sources:
- List of relevant songs and labels using the Billboard Chart API
- Charts to obtain songs and genre labels
- Billboard Hot 100 (Non-Rap/Hip-Hop)
- Billboard Rap Song (Rap/Hip-Hop)
- Span: 2000 to 2020, bi-monthly (1st and 15th of every month)
- Any song that appears in both charts are labeled as Rap/Hip-Hop
- Charts to obtain songs and genre labels
- Music attributes from the Spotify API via Spotipy Library
- Attributes: energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms
- Details of these attributes can be found in the Spotify API Documentation
Success criteria
- Model success metric: AUC-ROC curve
- Business success:
- Ranked list of a given pool of songs, from most likely to have Rap/Hip-Hop influences
- Provide insights on the most influential attributes for identifying Rap/Hip-Hop songs
├── README.md <- You are here
├── app/ <- Directory for application components
│ ├── docker_build.sh <- Bash script for creating application Docker image
│ ├── docker_run.sh <- Bash script for running app Docker container
│ ├── Dockerfile <- Configurations for app Docker image
│ ├── templates/ <- Directory for app templates
│ │ ├── error.html <- Error template when app cannot connect to database
│ │ ├── index.html <- Main application template
│
├── config <- Directory for configuration files
│ ├── flaskconfig.py <- Configuration of application
│ ├── logging.config <- Configuration of python logger
│ ├── pipelineconfig.py <- Configuration of modeling pipeline
│ ├── testconfig.py <- Configuration of pipeline validation
│
├── data <- Folder that contains data used or generated.
│
├── deliverables/ <- Any white papers, presentations, final work products that are presented or delivered to a stakeholder
│
├── figures/ <- Generated graphics and figures to be used in reporting, documentation, etc
│
├── models/ <- Trained model objects (TMOs), model predictions, and/or model summaries
│
├── notebooks/ <- Notebookes used in development
│
├── src/ <- Source data for the project
│ ├── get_data.py <- Functions to extract data from APIs
│ ├── predict_score.py <- Functions to predict probability of a given song
│ ├── train_model.py <- Functions to train/save predictive model and generate model metrics
│ ├── update_db.py <- Functions to create database and save predictions
│
├── test/ <- Files necessary for running model tests
│
├── app.py <- Flask wrapper for application
├── docker_build.sh <- Script to build model pipeline Docker image
├── docker_pipeline.sh <- Script to execute model pipeline in Docker
├── Dockerfile <- Configurations for Docker image
├── env_config <- Template to fill in necessary environment variables
├── Makefile <- Execution of model pipeline
├── requirements.txt <- Python package dependencies
├── run.py <- Script to run each component of the model pipline and make predictions
Four host environment variables are required for this application. First two are AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. These are required for downloading the training dataset from S3. The other two environment variables are for querying music attributes from the Spotify API. Please see instructions below to set them up. These four variables will be automatically applied to the Docker container with the provided scripts in Running pipeline and app in Docker.
Environment variables related to AWS RDS instances are optional. To use an RDS instance, please complete the env_config
template, then apply the environment variables:
source env_config
Environment variables SPOTIFY_CID
and SPOTIFY_SECRET
are required for obtaining data from the Spotify Web API. You must first create a Spotify user account (Premium or Free). Then go to the Dashboard page at the Spotify Developer website and, if necessary, log in. Accept the latest Developer Terms of Service to complete your account set up.
At the Dashboard, you can now create a new Client ID (i.e., a new app). Once you fill in some general information and accept terms and conditions, you land in the app dashboard. Here you can see your Client ID and Client Secret. Set Client ID as your environment variable SPOTIFY_CID
and the Client Secret as environment variable SPOTIFY_SECRET
. For screenshots of these directions, please see /figures
.
To extract the data from Billboard and Spotify API, run the following command:
python3 run.py create_dataset
The data will be saved in data/
and in your designated AWS S3 bucket. Note: it is common to have unsuccessful queries from the Billboard API for certain dates. It is even more common for the Spotify API to not have music attributes for songs from Billboard charts.
Once you extracted and saved the Billboard and Spotify dataset in S3, you can execute the entire model pipeline with the following command:
make pipeline
Perform unit tests with the following command:
make validate
Reset pipeline (i.e., delete files in /data
and /model
):
make clear
Execute each step of model pipeline with run.py
for more configurations.
python3 run.py download_data
python3 run.py train_model
python3 run.py create_db
python3 run.py validate
You can make a prediction and save it to your database with the following command:
python3 run.py predict --search 'Song to predict'
--engine
or-e
- Specify the use of
MySQL
database (requires configuration of AWS RDS credentials inenv_config
). Without this argument, the URI from the environment variableSQLALCHEMY_DATABASE_URI
is used. Without aSQLALCHEMY_DATABASE_URI
variable, aSQLite
database will be created in the data folder. - Applies to
run.py
commands:create_db
,validate
,predict
- Specify the use of
--uri
or-u
- Specify the use of a engine URI for database; overwrites
--engine
argument - Applies to
run.py
commands:create_db
,validate
,predict
- Specify the use of a engine URI for database; overwrites
--model
or-m
- Specify pathname for model object
- Applies to
run.py
commands:train_model
,predict
After creating the model and database, you can now run the application:
python3 app.py
The app is accessible at http://0.0.0.0:5000/ in your browser.
During the model pipeline, if you specified a pathname with the --model
argument, you must also include the same argument when running the app script (e.g., python3 app.py --model [model path]
)
By default, the app will use the SQLite database. To use another database URI, save the database URI as environment variable SQLALCHEMY_DATABASE_URI
.
First, make sure Docker Desktop is running. Then to build the image, run the following bash code from the root directory:
docker build -t litness .
This command builds the Docker image, with the tag litness
, based on the instructions in app/Dockerfile
and the files existing in this directory.
To run the pipeline, execute the following script:
docker run --mount type=bind,source="$(pwd)",target=/app/ litness pipeline \
-e SPOTIFY_CID=${SPOTIFY_CID} \
-e SPOTIFY_SECRET=${SPOTIFY_SECRET} \
-e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
To run unit tests, execute the following scrips:
docker run --mount type=bind,source="$(pwd)",target=/app/ litness validate \
-e SPOTIFY_CID=${SPOTIFY_CID} \
-e SPOTIFY_SECRET=${SPOTIFY_SECRET}
If you want to specify a specific engine URI for the pipeline, please set local environment variable SQLALCHEMY_DATABASE_URI
to the desired URI and add the following argument to both the pipeline and unit test script:
-e SQLALCHEMY_DATABASE_URI=${SQLALCHEMY_DATABASE_URI}
The pipeline and unit tests are orchestrated with the Makefile
. To add additional arguements (e.g., model path), please updated the Makefile
with arguements from Additional arguments
To build the image, execute the following bash code from the root directory:
docker build -f app/Dockerfile -t litness .
To run the application, execute the following script:
docker run --mount type=bind,source="$(pwd)",target=/app/ \
-e SPOTIFY_CID=${SPOTIFY_CID} \
-e SPOTIFY_SECRET=${SPOTIFY_SECRET} \
-p 5000:5000 \
--name test litness app.py
If you specified a SQLALCHEMY_DATABASE_URI
variable, you must add the -e SQLALCHEMY_DATABASE_URI=${SQLALCHEMY_DATABASE_URI}
arguement to the application execution script as well.
Once finished with either pipeline or application, you will need to kill the container. To do so:
docker kill litness
where litness
is the name given in the docker run
command.
Outline format:
-
Initiative
- Epic
- Story (size)
- Epic
-
Gather sufficient data to analyze hip-hop trends
- Obtain list of popular rap and general mainstream songs over the past 20 years
- Configure program to pull data from Billboard API (M)
- Obtain monthly top 25 songs from the Billboard Rap Chart and top 50 songs from Billboard Hot 100 Chart at the beginning and middle of each year from 1990 to 2019 (S)
- Fetch audio attributes of songs
- Configure program to pull data from Spotify API (M)
- Pull song attributes of songs from Billboard charts (S)
- Obtain list of popular rap and general mainstream songs over the past 20 years
-
Identify ideal model and attributes that can best differentiate music from different eras
- Perform data exploration and cleansing
- Evaluate audio attribute differences between rap/hip-hop songs and other mainstream music (M)
- Understand root cause of missing values, balance categories, etc. (M)
- Model relationship between songs and their attributes
- Conduct data transformations and feature engineering (L)
- Explore various model constructs and evaluate model accuracy (L)
- Perform data exploration and cleansing
-
Derive strategic insights to client based on model results
- Evaluate model metrics
- Calculate CV model accuracy (S)
- Calculate CV r-squared (S)
- Evaluate feature importance (S)
- Generate interpretations of model results
- Evaluate differences between current hip-hop songs compared to past songs (M)
- Develop stakeholder presentation (L)
- Evaluate model metrics
-
Create tool to take in new songs and predict the era that the song was created
- Bring model into production
- Create virtual environment with necessary packages (M)
- As a user, I want to be able to type in a song and have the model predict the probability that the song is of rap/hip-hop genre (L)
- Test robustness of model
- Test edge cases (e.g., Spotify does not have attributes of specific songs) (L)
- Evaluate model accuracy (M)
- Bring model into production
Icebox
- Analyze samples used by songs over the years
- Scrape song sampling data from WhoSampled.com
- Create program that can retrieve list of samples used by given song (L)
- Retrieve list of sample used by every song in the top Billboard list (M)
- Create model features based on the sampled songs
- Use Spotify API to fetch song attributes (S)
- Clean / transform data into useable features (M)
- Evaluate impact of new features
- Evaluate model metrics (S)
- Derive new insights based on model results (M)
- Scrape song sampling data from WhoSampled.com