GenreClassifier: A Jupyter Notebook repository from EricHeidbreder

Genre Classification

I've chosen to build a genre classifier! This project has the most readily available data (scraping Spotify using the spotipy library).

Problem Statement

Using features derived from an audio source itself, can a classification model predict the genre of a 30-second audio clip with high enough accuracy to organize new songs into their respective genres?

Classifying genre is important for music distribution and streaming platforms, it also helps listeners find new bands they might like, and it, in turn, helps musicians connect with new audiences.

My goal is to be able to predict genre using features derived from the audio signal to classify genre. This could be helpful for building playlists algorithmically without needing humans to manually input data, and can help artists assign their music to all the genres that their music fits the description of.

The metric for success here is accuracy--any incorrect response is a bad one.

Data Collection

Scraping 30-second song samples using spotipy - I will start with 5 general genres and dig down into the subgenres of each:

Original 5 Genres:
- Classical
- Progressive Bluegrass
- Rock
- Rap
- R & B
Extra 5 Genres:
Tropical House
Pop
Baroque
Serialism
Hip Hop

I ended up with 9365 unique songs and wanted to have 500 songs from each of the above genres for my classification model.

Note that after my first round of modeling, I was getting about 84% accuracy with the original 5 genres. With classical, I was getting 95% accuracy. I included Baroque, Serialism, and Ambient in the additional set of 5 genres to test the suspicion that Classical was easier to predict because it was quieter than the other 4 original genres. Baroque and Serialism are subsets of classical music. The Baroque period is from approximatlely the 1600s-1800s (there's debate about the blurry start and end, but it's not particularly important here). Serialism is a subset of classical music emerging in the 20th century. If my model can continue to predict between classical, baroque, and serialism well given these closely-related genres, I'd be very impressed!

Similarly, I included Tropical House and Folk to try to add genres closely related to the Rap and Progressive Bluegrass genres, respectively. Unfortunately, Folk didn't have enough unique songs to meet my threshold of at least 500 unique songs before sampling.

Data Dictionary

The dataset used for modeling contains features from Spotify and newly-extracted features extracted from the music:

Feature Name	Type	Description
`id`	string	The unique id for each song (you may see duplicates in the dataset because there are multiple genres per song)
`genre`	string	Genre (seems to be connected to artist)
`track_name`	string	Name of the song
`preview_url`	string	Url where you can listen to a 30-second sample of the song
`location`	string	Relative path of the song
`filename`	string	Filename of the song within project directory
`artist`	string	Artist on the song
`artist_uri`	string	Artist uri (Spotify's unique id for the artist)
`album`	string	Album containing the song
`release_date`	string	Date when the song was initially released
`danceability`	float	SPOTIFY FEATURE: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
`energy`	float	SPOTIFY FEATURE: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
`key`	int	SPOTIFY FEATURE: The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
`loudness`	float	SPOTIFY FEATURE: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
`mode`	int	SPOTIFY FEATURE: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
`speechiness`	float	SPOTIFY FEATURE: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
`acousticness`	float	SPOTIFY FEATURE: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
`instrumentalness`	float	SPOTIFY FEATURE: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
`liveness`	float	SPOTIFY FEATURE: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
`valence`	float	SPOTIFY FEATURE: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
`tempo`	int	SPOTIFY FEATURE: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
`energy_calc`	array-like object	NEWLY EXTRACTED FEATURE: Total magnitude of the signal at different windows of time. Note this is not the same as Spotify's energy feature
`rms_calc`	array-like object	NEWLY EXTRACTED FEATURE: Root Mean Squared Energy, measure of loudness across different windows of time
`mfcc_calc`	array-like object	NEWLY EXTRACTED FEATURE: Multi-step calculation used to create a non-linear representation of a sound's spectral envelope
`mean_energy`	float	NEWLY EXTRACTED FEATURE: Mean of the `energy_calc` feature
`std_energy`	float	NEWLY EXTRACTED FEATURE: Standard Deviation of the `energy_calc` feature
`min_energy`	float	NEWLY EXTRACTED FEATURE: Minimum of the `energy_calc` feature
`max_energy`	float	NEWLY EXTRACTED FEATURE: Maximum of the `energy_calc` feature
`mfcc_mean`	array-like object	NEWLY EXTRACTED FEATURE: Mean of the `mfcc_calc` feature, still contains an array of filters at each window
`mfcc_std`	array-like object	NEWLY EXTRACTED FEATURE: Standard deviation of the `mfcc_calc` feature, still contains an array of filters at each window
`mfcc_min`	array-like object	NEWLY EXTRACTED FEATURE: Minimum of the `mfcc_calc` feature, still contains an array of filters at each window
`mfcc_max`	array-like object	NEWLY EXTRACTED FEATURE: Maximum of the `mfcc_calc` feature, still contains an array of filters at each window
`max_rms`	float	NEWLY EXTRACTED FEATURE: Mean of the `rms_calc` feature
`std_rms`	float	NEWLY EXTRACTED FEATURE: Standard Deviation of the `rms_calc` feature
`median_rms`	float	NEWLY EXTRACTED FEATURE: Median of the `rms_calc` feature
`min_rms`	float	NEWLY EXTRACTED FEATURE: Min of the `rms_calc` feature

EDA

Methods for cleaning and preprocessing data

Using mostly the librosa library along with existing techniques gleaned from the Music information retrieval community, I'll use various methods for extracting features from the music itself. The methods I use to extract features from the audio are:

Energy and Root Mean Squared Energy (RMSE) - energy and RMSE are different measurements of loudness, can be measured in windows across a 30-sec range
Fast Fourier Transform (fft) - converting a time-window of an audio source into a snapshot of the frequency spectrum
Mel Frequency Cepstral Coefficients (mfcc) - Creates overlapping bins along the log frequency spectrum and stores the power of each of those bins across windows of time.

Notes for recreating results:

When extracting new features, I consistently used the following variables:
- nfilt (number of filters for mfcc calculation) - 12
- frame_length (number of samples for each frame/window to make calculations on) - 512
- hop_length (the number of samples to hop through before making a calculation) - 256
When setting up my Convolutional Neural Net, I had to shorten every song sample to 2584 frames. It seems that a few songs had previews of more than 30-seconds.

Models

I tested out Support Vector Machines and Convolutional Neural Networks, they seem to get good results when dealing with audio classification problems, in general. Audio signal can be converted into images using the above preprocessing steps.[1] Ultimately, I went with a SVC model as my production model. The SVC model performed better and ran faster than the CNN.

Results

My best performing models were SVC models with varying C values tuned to minimize overfitting (5% tolerance for overfitting).

Model type	Number of Genres	Acc (Spotify Features)	Acc (Extracted features)
SVC	5	78.4%	84.2%
SVC	10	50.6%	56.5%

Percentage of incorrect classifications per genre using Spotify's features on 5 genres

Genre	Percent_of_wrong_predictions
rap	31.94
r&b	26.39
rock	20.14
progressive bluegrass	13.19
classical	8.33

Percentage of incorrect classification per genre using my extracted features on 5 genres

Genre	pct_predicted_wrong
r&b	33.33
rap	25.49
progressive bluegrass	18.62
rock	16.66
classical	5.88

Percentage of incorrect classifications per genre using Spotify's features on 10 genres

Percentage of incorrect classifications per genre using my extracted features on 10 genres

Extracted Features Confusion Matrix

Conclusions

My models using extracted MFCCs outperform models using spotify's features.
When subgenres are present, a false prediction by the model tends to fall within related genres. For example: classical, baroque, and serialism were the most likely to be confused for each other, but those all fit under the umbrella of classical.
Because the model tends to predict well on more general genres (like the 5-genre model), this could beneficial for auto-generating recommended metadata for artists. It would still need some human input to better train, since an incorrect prediction here would lead to potential lost revenue for artists and unpleasant listening experiences for users (users might expect one genre but hear another).
Further exploration needs to be done to figure out why the subgenres are being classified together - OR this could be an interesting way to find intersections between genres that are sonically adjacent to each other.
If sub-genres are getting confused for each other, are there other factors that go into assigning sub-genre (What geographic location was it written? What are the lyrics about? When was it written?)

Future Steps

Optimize the preprocessing steps (librosa still loads files quite slowly, it seems)
Include more genres/subgenres
Extract more features from the music: Tempo, pitch content, instrument types?
Run some unsupervised learning to see what clusters develop
Deploy a web app that will tell you the genre of a song - useful for generating automatic metadata

EricHeidbreder/GenreClassifier