/baa-doc-deep-embedded-music

Documentation for bachelor thesis "deep embedded music"

Primary LanguageTeX

Bachelor Thesis: Deep music embedding

For humans to distinguish between music genres is from easy to difficult, depending on how similar the given genres are. It is rather simple to distinguish music songs, for example, of classical music and techno or country and hip-hop rap. However, the difficulty increases if songs within the same genre are compared. When it comes to finding the similarity between songs or to sort a list of songs by their similarity, it is hard, even for humans. In this thesis, an alternative learning strategy is developed that exploits basic semantic properties of sound that are not grounded by an explicit label. This thesis intends to adapt Tile2Vec, an image embedding method, to audio streams and evaluate its performance on the "SINS, DCASE 2018: task 5" noise detection dataset as well as on a music dataset, consisting of songs of seven sub-genres. Tile2Vec is an adaption of triplet loss to an unsupervised setting using the distributional hypothesis from natural language, which states that words appearing in similar contexts tend to have similar meanings.

For this work, log Mel spectrograms were used to represent the audios in a more compact form. The selection of triplets is made by using the temporal proximity, and as the network, a state-of-the-art ResNet18 architecture is used. The embedding space is created by training a ResNet18 model utilising the triplet loss and using the log Mel spectrogram as input. This creates a lower-dimensional space, where the distances within, represent the similarities between the data points.

The first experiments with the noise detection dataset confirmed that the embedding space could learn semantic properties even without any label. The conducted error analysis revealed misclassified segments and microphones malfunctions in the dataset. Moreover, the embedding space showed clusters of similar-sounding sounds, regardless of its label. A simple linear logistic classifier trained on the resulting embeddings is compared to the results from the DCASE challenge. The best classifier of this thesis reached an F1 score of 62.19%, while most of the models submitted to the challenge accomplished results 80% and higher.

The same architecture is used to train an embedding space for music, which showed similar significant results. It revealed clear clusters when applying k-Means to it. Moreover, each cluster showed a significantly higher amount of segments of a specific class, which demonstrates that the model succeeded in building clusters for each genre. An expert affirmed, that the neighbours of each segment, even if from a different category, build a neighbourhood, which sounds similar and is therefore correct. When training the simple classifier on the music embedding space, it achieved an astonishing F1 score of approximately 84% on unseen test data. The expert, who performed the qualitative analysis, was thrilled and impressed with the results.

To conclude, the learnt embedding spaces from both datasets successfully learnt to find similarities regardless of its ground truth even if the categories are very similar to each other.