The project was completed in September 2019.
I wanted to understand if it is possible to create a good music recommendation system, which would use only the statistics of song lyrics, without using NLP and the sound waves themselves. To do that, I decided to create a simple few layers neural network and find the data to train it.
Unfortunately, I did not find a suitable dataset, so I collected mine and it was the largest open dataset with a variety of metadata (Sep 2019). Then it was decided to open-source the dataset on kaggle. Also, every related work and analysis will also be there.
Dataset consists of:
songs_dataset.csv
contains 253k+ songs (different genres, decades and so on) with 10 features:
|Singer|Album|Song|Date|Featuring|Genre|Lyrics|Tags|Producers|Writers|
;parts_dataset.csv
contains songs with lyrics split into parts (verse, chorus, hook, etc.).
It's not that obvious due to the dirty (real-world) data.
Currently available notebooks:
- Data analysis and Plotly visualizations can be found here;
- Creation of the simple dense NN and further exploration of received singers' embeddings can be found here.
UPD: dataset on kaggle was deleted, so now you can get it via Google Drive.
Thanks to Ilya Liyasov for helping develop the songs parser.