music-genre-classification

This is the final project for course CSCI-SHU 360-001 Machine Learning Spring 2020 done by me and Ren Sheng.

Inspired by personal experience of difficulty in recognizing music genres, the goal of the project is to classify music into different genres, for example, Blues, Rock, and Jazz.

For the data preprocessing part, we make use of MFCC to convert audio into feature matrix. MFCC is short for the mel-frequency cepstral coefficients. They are derived from a type of cepstral representation of the audio clip. In this way, we can represent every single musical segment with several coefficients, in other words, an array. A song consists of many segments, in this way we can convert a song into a matrix with floating numbers. MFCC is widely used in music information retrieval techniques, such as speech recognition. In our dataset, musical segments are represented by 12 different MFCC features.

Speaking of dataset, we got two types from Million Song Dataset. One is audio feature dataset, which consists of the feature matrix and metadata for one million songs. Since the whole dataset is too large, we only choose one subset including 10,000 songs. Here are all the feature matrix and their shape. The other one is genre dataset. We’ve got two subsets in all from the website, which consist of pairs of music tracks and corresponding genres.

We will first get the feature matrix and its track_id from the first dataset and then find the genre corresponding to its track_id in the second dataset.

When preprocessing our data, we first got the intersection between feature dataset and genre datasets, which is 4,844. Then normalize the genres between two genre subsets, which means we need to filter those labels that only appear in one dataset. We now got 15 labels and 3,396 songs. After normalizing genres, we extract the feature matrix that we are going to analyze under the label of analysis --segments timbre. It is in the shape of m by 12, where m depends on the length of the music. In order to normalize feature matrix, we’ve set them into 400 by 12. Till now, there are 3,190 left. After preprocessing, we then split them into three sets, training 1500, validation 190, and test 1500.

Since our dataset is in the form of matrices, firstly we use Convolutional Neural Network as our model. According to what we have found on the Internet, most algorithms have only 1 channel, which means they keep the matrices in the original form, 400 by 12. This is also the most straightforward idea, since the original matrix is naturally formed, and thus there would be many natural features between the numbers. And this is the first time we conduct a Convolutional Neural Network on a non-square matrix. Since our matrix is not so large, we cannot have too complex Neural Networks. We have learned from classic neural networks, changed many times, and this is our final solution. In this CNN model, we always get the test accuracy around 21%. After optimization, it sometimes can reach 23%, but it is still a poor result.

After this, we turned to another idea, support vector machine. This was once briefly mentioned in Professor Gus Xia’s lecture about PCA. Since music clips are linear datasets, so it may also be considered as a vector, and thus we might use svm on this. This is done by a widely used module called sklearn. As a result, svm also only gives us a 24% test accuracy. We thought 4800 may be too long for a vector, so we use PCA, orthogonalization, to reduce the dimensions of the vector to 1000, but that gave us an even smaller test accuracy, 16%.

Now both of the networks fail to give us a satisfying result, so we headed back to an idea that I came up when doing CNN. Can we treat 12 as channels? This means we consider different MFCC features individually, and thus we can have a 20 by 20 by 12 matrix sets. This idea was abandoned initially because in this way matrix will be manually created, so the features we get from convolutional networks may not be some features it originally has. But we also tried this idea, and the result is a little surprising. The test accuracy fluctuates around 26%, and it reaches 27% at most. We can see an averagely better test accuracy that the old CNN, and also the support vector machine.

Though we have tried different optimizers and ways of regularization, the old CNN and svm cannot give us a good result. Although the new CNN also preforms not well, it gives a higher test accuracy. And we may conclude that considering the number of MFCC features is a better way for CNN.

We still have a lot to improve. First of all, 1500 samples are too small for our model. Million Song Dataset provides us with a 10000 dataset, but as mentioned, only 3000 of them satisfy our needs. If possible, we may search for more data and train with a longer time. Second, we hope to find a more diverse dataset. In our dataset, Rock songs is a large part, so actually rock tends to be the most possible prediction that our model will give. It will perform better if we have a more diverse distribution. Third, it is still unclear how we handle the structure of MFCC features. It is a non-square matrix with the feature of vector, and from out result we cannot know whether to consider it as a vector or a matrix. After all, there is still a long way to go for our further study in deep learning.

In addition to classifying single songs, there are some further applications including creating song lists according to different genres, recommending songs with same genre labels based on individual’s predilection, and visualizing the genre diversity of one singer.