This little personal project was designed solely to test the performance of some machine learning classification algorithms. In order to work with some original dataset (I'm growing tired of MNIST, IRIS, etc) I decided to use the nice Spotify's API to get a few features from some musics. Spotify provides a set of float metrics that describes some relevant characteristics of a music. They are:
- Danceability;
- Energy;
- Key;
- Loudness;
- Mode;
- Speechiness;
- Acousticness;
- Instrumentalness;
- Liveness;
- Valence;
- Tempo.
The meaning of each feature can be found on Spotify Audio Features
After deciding the features with which to work, it was time to gather some data and come up with some labels. Assuming there is a real relationship between this set of features and the genre of the music it is used to describe, I used the API to collect as many musics as possible for a pre-defined set of genres, or categories. The ones I've chosen were:
- Pop;
- Indie Alt;
- Punk;
- Funk;
- Rock;
- Hip-Hop;
- Metal;
- Country;
- Jazz;
- Reggae;
- Classical;
- Party;
- Latin;
- Romance;
- Blue.
Due to some different responses in the API, the data collected was not uniform, i.e., some genres have way more instances than others. The resulting dataset can be summarized as below.
Genre/Category | # of instances |
---|---|
Classical | 2983 |
Latin | 2633 |
Metal | 2238 |
Indie Alt | 1793 |
Romance | 1568 |
Rock | 1503 |
Jazz | 1481 |
Hiphop | 1174 |
Party | 1030 |
Funk | 906 |
Country | 890 |
Pop | 865 |
Reggae | 649 |
Blues | 596 |
Punk | 419 |
The machine learning algorithms selected to attack the aforementioned classification problem were:
The Scikit-Learn library was used for pre-processing the dataset as well as implementing said algorithms.
After analyzing the cross validation error for each algorithm, in addition to performing a grid search to find the best suited hyperparameters, the SVM method was deemed as the best one, with an average score of 0.51 (percentage of correctly classified instances on the validation set). By looking at its learning curve, it is clear that the algorithm is still underfitting, thus it would benefit from more complex features other than the current set.
You can play with the trained model by trying to make it predict the genre of some music.
First of all, you need to sign-up for the Spotify's API on their website and create an application. This is necessary because in order to obtain the features of the music you are trying to test an API request is needed.
Then, go to the folder api/ and fill the file "client_keys.json" with your keys from the last step.
All ready and set!
Run main.py from the command line with the name of your music as an argument. For instance:
$ python main.py "in the end"