Music-Genre-Classification

This was made as a part of Machine Learning course at IIIT,Delhi. Link to blog

Dataset Description

We are using the GTZAN dataset which contains a total of 1000 audio files in .wav format divided into 10 genres. Each genre has 100 songs of 30-sec duration. Along with the audio files, 2 CSV files containing features of the audio files. The files contain mean and variance calculated over multiple features that can be obtained from an audio file. The other file has the same composition, but the songs were divided into 3-second audio files.

Audio signal feature extraction:

We convert every audio file to signals with a sampling rate to analyze its characteristics. Every waveform has its features in ttwo forms:

Time domain- nothing much information about music quality can be extracted and explored apart from visual distinction in the waveforms
Frequency domain which we get after fourier transform of two types: Spectral features and Rhythm features

Spectral

Rhythm features

MFCC and Rhythm feature plots provide a matrix based information for the unique features. Both the features have been mapped with the duration of the music file.

Preprocessing

After extraction of features, all columns were not null. So extra values were not added. Why is it important to preprocess the data?

The variables will be transformed to the same scale.
So that all continuous variables contribute equally
Such that no biased results
PCA very sensitive to variances of initial variables If large variance range difference between features , the one with larger range will dominate
The boxplots of each feature shows some features have very large differences in their variances.
PCA with both normalisation(minMaxScaler) and standardisation(StandardScaler) is done and difference noted.

Methodology

Feature extraction -> correlation matrix -> PCA

With 30 secs sample
With 3 secs sample
Less outliers/ variance for some classes found in principal components:

Inferences till this step:

pca.explained_variance_ratio_=[0.20054986 0.13542712] Shows pc1 holds 20% percent of the data, pc2 holds 13% of the data
Big clusters of metal , rock, pop ,reggae, classical can be seen.
Jazz ,country, are separable to one extent.
Hip-hop,disco,blues are very dispersed and can’t be seen
Majority are easily separable classes
Decided to proceed to modelling phase by using 3 sec sampled feature set with standardization as it aggregated the genres into more linearly separable clusters than normalisation

Classification:

Logistic

This model is a predictive analysis algorithm based on the concept of probability. GridSearchCV was used to pass all combinations of hyperparameters one by one into the model and the best parameters were selected.

Without Hyperparameter tuning:

Metric	Value
Accuracy score	0.67267
Precision	0.74126
Recall	0.74098

Using Hyperparameter tuning:

Metric	Value
Accuracy score	0.70504
Precision	0.70324
Recall	0.71873

SGD Classifier

Took SGD as baseline model and performed hyperparameter tuning for a better performance.Though difference werent that great even after HP tuning.

Without Hyperparameter tuning:

Metric	Value
Accuracy score	0.6126126126126126
Precision	0.6142479131341332
Recall	0.6172558275062101

With Hyperparameter tuning:

Metric	Value
Accuracy score	0.6441441441441441
Precision	0.6386137102787109
Recall	0.6421140902032518

Gaussian NB

We used a Simple Naive Bayes classifier, one vs Rest Naive Bayes as baseline models. Then used Hyperparameter testing to get better performance.

Without Hyperparameter tuning:

Metric	Value
Accuracy score	0.48598598598598597
Precision	0.4761542269197442
Recall	0.4902979078811803

With Hyperparameter tuning:

Metric	Value
Accuracy score	0.5155155155155156
Precision	0.49864157768533374
Recall	0.5050696700999591

KNN

This model almost outperformed compared to Gaussian NB models. As we can see , after HP tuning , correlation between the features has decreased, some had even 0 correlation.

Without Hyperparameter tuning:

Metric	Value
Accuracy score	0.8603603603603603
Precision	0.8594536380364758
Recall	0.8583135066852872

Using hyperparameter tuning :

Metric	Value
Accuracy score	0.9059059059059059
Precision	0.9073617032054686
Recall	0.905944266718195

Best params: {'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}

Decision Trees

Took DT as baseline model which didnt give great results, with accuracy around 65%.

Metric	Value
Accuracy score	0.637758505670447
Precision	0.6396387192624916
Recall	0.6376582879474517

Used ADA boosting which reduced the performance(rock,pop,disco)

Metric	Value
Best parameters	n_estimators=100
Accuracy score	0.5010006671114076
Precision	0.48730102839842837
Recall	0.4992406459587978

Then gradient boosting which increased the accuracy exponentially.

Metric	Value
Best parameters	n_estimators=100
Accuracy score	0.8238825883922615
Precision	0.8266806080093154
Recall	0.8232200760446549

CatBoost was having high AUC for all genres unlike gradient which had low accuracy for some genres

Cat boost outperformed ensemble methods. Gradient boost was close enough with 82% accuracy, rest all were in between 50-60%

Metric	Value
Best parameters	loss function:”Multiclass”
Accuracy score	0.8972648432288192
Precision	0.8979267969111706
Recall	0.8972734276109252

Random Forest

As shown here RF was having around 80% accuracy but XGB boosting reduced the accuracy to 75%

Metric	Value
Best parameters	n_estimators=1000 max_depth=10
Accuracy score	0.8038692461641094
Precision	0.805947955999254
Recall	0.8026467091527609

Cross Gradient Boosting on Random Forest reduced the accuracy , it even reduced precision ,recall to large extent.

Metric	Value
Best parameters	objective= 'multi:softmax'
Accuracy score	0.7505003335557038
Precision	0.7593347049139745
Recall	0.7494976488750396

XGB Classifier

Correlation matrix shows there is very less correlation among variables

Best performed model among all DT and RF models with every genre was classified with atleast 85+% accuracy
Genres like classical,hiphop had even 100% accuracy
XGBoost improves upon the basic Gradient Boosting Method framework through systems optimization and algorithmic enhancements.
Evaluations

Metric	Value
Best parameters	learning rate:0.05, n_est =1000
Accuracy score	0.9072715143428952
Precision	0.9080431364823143
Recall	0.9072401472896423

MLP

This model is an Artificial Neural Network involving multiple layers and each layer has a considerable number of activation neurons. The primary training included random values of hyperparameters except the activation function . This initiation reflected overfitting in the data for different activation functions :

Activation	Training Accuracy	Testing Accuracy
relu	0.9887142777442932	0.5206666588783264
sigmoid	0.941428542137146	0.4970000088214874
tanh	0.9997143149375916	0.49266666173934937
softplus	0.9991428852081299	0.5583333373069763

From the following graph, we choose softplus to be the best activation function, considering softmax to be fixed for output
Upon looking the graph, we can conclude a very high variance in testing and training accuracy and so we know that our model is overfitting. In fact the testing loss starts to increase which indicates a high cross entropy loss. This will be dealt later. For now we see that softplus, relu and sigmoid, all 3 have performed similar on training and testing set thus we will go with softplus since it provides a little less variance than others.

Hyperparameter tuning has been done manually by manipulating the following metrics:

Learning rate
activation = softmax
no. of hidden layers = 3; neurons in each = [512,256,64]
activation of output layer is fixed to be softmax epochs = 100

Learning Rate	Training Accuracy	Testing Accuracy
0.01	0.4044285714626312	0.335999995470047
0.001	0.9888571500778198	0.5666666626930237
0.0001	0.9684285521507263	0.5513333082199097
0.00001	0.7134285569190979	0.4996666610240936

From the above graphs we see that 0.01 definitely results in over convergence and bounces as reflective from the accuracy graph. 0.001 has a very high variance and loss increases margianally with low acuracy so it isn't appropriate as well.

The best choice for alpha is either 0.0001 or 0.00001.
0.00001 has a relatively low variance and loss converges quickly with epochs but accuracy on training and testing set is pretty low.
0.0001 has a better performance but variance is very high

no.of hidden layers
activation = softmax
learning rate = 0.0001
activation of output layer is fixed to be softmax epochs = 100

Number of layers	Training Accuracy	Testing Accuracy
2	0.9782857298851013	0.5383333563804626
3	0.9869999885559082	0.5443333387374878
4	0.9921428561210632	0.5506666898727417

In conclusion, increasing or decreasing the number of layers have no effect on variance. This is because we have too many neurons per layer. So we take 3 layers and reduce the number of neurons.

Number of neurons
activation = softmax
learning rate = 0.0001
number of layers = 3
activation of output layer is fixed to be softmax epochs = 100
drop out probability = 0.3
alpha = 0.001

Number of neurons	Training Accuracy	Testing Accuracy
[512, 256, 128]	0.9984285831451416	0.563666641712188
[256, 128, 64]	0.915142834186554	0.5149999856948853
[180, 90, 30]	0.7991428375244141	0.503000020980835
[128, 64, 32]	0.6991428732872009	0.4900000095367431

Now for the same neuron set, we apply regularization and neuron dropout to find any change in the variance for high number of neurons with reducing the number of neurons

regularization and decomposition

Number of neurons	Training Accuracy	Testing Accuracy
[512, 256, 128]	0.6759999990463257	0.5830000042915344
[256, 128, 64]	0.5278571248054504	0.5189999938011169
[180, 90, 30]	0.43642857670783997	0.4629999995231628
[128, 64, 32]	0.386428564786911	0.4203333258628845

So in conclusion, if we have high number of neurons per layer, then applying regularization techniques will increase the accuracy and decrease the variance overall. If we do not apply any regularization techniques then we can have moderate number of neurons to have a decent accuracy on training and testing set with low accuracy.

For our purposes, we select high number of neurons per layer with regularization

Final MLP model

From all our analysis and extra experimentation we conclude our model with following metrics:

activation : softmax
learning rate : 0.0001
number of hidden layers = 3
number of neurons in each layer = [512,256,128]
epochs = 100
regularization and dropout true

Precision on the model : 0.5774000692307671
Recall on the model : 0.583
F1score on the model : 0.5801865223684216
Accuracy on the model : 0.6130000042915345

Even after hyperparameter tuning, the best accuracy is just above 60%. The reason is simply because of overfitting and underperformance due to inability to pick up each feature. This creates amazing accuracy on the training set but always misses out on the testing set.

SVM

This model outperformed every other model and gave the best accuracy. Manual hyperparameter tuning was done. Linear, polynomial and RBF kernel were compared using confusion matrix.

Best Linear Kernel Model:

Metric	Value
Best parameters	C=1.0,kernel='linear',random_state=0
Accuracy score	0.70672342343265456
Precision	0.7180431364823143
Recall	0.71234655872896242

Best Polynomial Kernel Model:

Metric	Value
Best parameters	C=1.0,kernel='poly',degree=7
Accuracy score	0.88242715143428952
Precision	0.8780431364823143
Recall	0.87035601472896557

Best RBF Kernel Model:

Metric	Value
Best parameters	C=200,kernel='rbf',gamma=4
Accuracy score	0.9424715143428952
Precision	0.939297323879391
Recall	0.9372401472896423

Conclusions:

SVMs performed the best among all classifiers with 94% accuracy
Gaussian outperformed polynomial kernel in almost all iterations
XGB classifiers were the best among all ensembling methods with 90% accuracy.
Since genre classes were balanced , the tradeoff between precision and recall was less observed.
Among all KNN,DT and ensemble classifiers , precision was more than recall
While in case of LR,SGD,NB,MLP,SVM recall was observed more than precision.

ashcode028/Music-Genre-Classification

Music-Genre-Classification

Dataset Description

Audio signal feature extraction:

Spectral

Rhythm features

Preprocessing

Methodology

Inferences till this step:

Classification:

Logistic

SGD Classifier

Gaussian NB

KNN

Decision Trees

Random Forest

XGB Classifier

MLP

Hyperparameter tuning has been done manually by manipulating the following metrics:

For our purposes, we select high number of neurons per layer with regularization

Final MLP model

SVM

Best Linear Kernel Model:

Best Polynomial Kernel Model:

Best RBF Kernel Model:

Conclusions: