Now that you have scrapped the website Billboard to create a hot_songs dataset, it's time to prepare a new dataset of not_hot_songs. This dataset can contain songs of your choice, others collected from the web or any other combination. Some sources of songs can be:
- Wikipedia
- Subset of million songs dataset Note: this dataset takes several GB of disk space!!!
- Kaggle
You want your dataset of not_hot_song to be:
- As heterogeneous in terms of (genre, length,...etc) as possible to create better groups of songs.
- Not too big and not too small (typically around 2-3K) songs
In a real-life scenario, you might want to have your dataset as biggest as possible and use specialized Big Data techniques like PySpark to group similar songs together. However, you are going to work on your own laptop which has limited power. Therefore, you need to limit the size of your dataset of not_hot_songs otherwise the process of grouping similar songs will take forever.
Your fork should contain a jupyter notebook with the code to:
- Gather the songs
- Remove songs already present in the hot_songs dataset