Recognizing music genre is a challenging task in the area of music information retrieval. Two approaches are studied here:
- Spectrogram based end-to-end image classification using a CNN (VGG-16)
- Feature Engineering Approach using Logistic Regression, SVMs, Random Forest and eXtreme Gradient Boosting.
For a detailed description about the project, please refer to Music Genre Classification using Machine Learning Techniques, published on arXiv.
The Audio Set data released by Google is used in this study. Specifically, only the wav files that correspond to the following class labels are extracted from YouTube based on the video link, start and end times.
- tensorflow-gpu==1.3.0
- Keras==2.0.8
- numpy==1.12.1
- pandas==0.22.0
- youtube-dl==2018.2.4
- scipy==0.19.0
- librosa==0.5.1
- tqdm==4.19.1
- scipy==0.19.0
- Pillow==4.1.1
Note: If you encounter any problem in installing the modules you just need to go to python unofficial binnaries and according to your python version you can install them.
- First, the audio wav files need to be downloaded using the tool youtube-dl. For this run
audio_retrieval.py
. Note that the each file is about 880 KB, totally upto 34 GB! - Next, generate MEL spectrograms by running
generate_spectrograms.py
. If needed, you may modify the same file to change the Short Time Fourier Transform (STFT) parameters. - The next step is to run the models. Please refer to the corresponding Jupyter notebooks. The deep learning based models are present in notebooks 3.1, 3.2 and 3.3. Notebooks 4 and 5 contains steps for feature extraction (run
feature_extraction.py
) and building the classifiers usingsklearn
.
The models are evaluated on the basis on AUC, accuracy and Fscore.
The most important 20 features based on the XGB classifier are shown below. The metric on the x-axis refers to the number of times a given features appears as a decision node in all of the decision trees used to build the gradient boost predictor.
The confusion matrix of the ensemble XGB and CNN classifier: