Exploratory data analysis and visualization of the “Million Songs Dataset” mainly using Python, Pandas and Matplotlib.
Notes:
- Dataset retrieved from: http://labrosa.ee.columbia.edu/millionsong/
- HDF5 getter functions provided/taken from: https://github.com/tbertinmahieux/MSongsDB/blob/master/PythonSrc/hdf5_getters.py
- All analysis is on the sample set
- NOTE: Written in python 2.7 due to inconsistencies with the python 3 version of the tables module and the hdf5 getter functions provided by Thierry Bertin-Mahieux.
If you want to run the code:
>>> git clone https://github.com/eltonlaw/msd_data_exploration.git
Then download the dataset and move the downloaded and unzipped dataset into the repo
>>> mv MillionSongSubset msd_data_exploration/MillionSongSubset
Also need to have the hdf5 helper functions from Thierry Bertin-Mahieux's GitHub repo
>>> cd msd_data_exploration
>>> git clone https://github.com/tbertinmahieux/MSongsDB
Make a temp folder for the output graphs:
>>> mkdir temp
Your directory should look something like this now:
msd_data_exploration
│ grab_data.py
│ MillionSongsSubset
│ model.py
└───MSongsDB
│ │ ...
│ └───Python Src
│ │ hdf5_descriptors.py
│ │ hdf5_getters.py
│ │ ...
| README.md
| run.py
| scrape_categories.py
└───temp
Now you should be able to run the analysis with this command:
python -m run.py
>>> python scrape_categories.py
Columbia shows an example datapoint here. I wrote a simple web scraper using Beautiful Soup to print the categories and descriptions to avoid the hassle of going to the website each time...example output in Terminal
basic_info(categories=["tempo","duration","key","time_signature","song_hotttnesss"])
Print skew, distribution and pairwise correlation for the 5 following categories: tempo, duration, key, time_signature, song_hotttnesss.
The elements tested are linearly independent.
world_plot(lat="artist_latitude",lon="artist_longitude")
Plot the latitude and longitude of each artist.
Most of the datapoints are coming from North America and EU, the subset data is not representative of the population.
freq_plot(category="year")
Plots the normalized frequency of songs for each year in ascending order
stacked_bar_plot(full="duration",head_end="end_of_fade_in",tail_start="start_of_fade_out")
Plots the song duration in black and overlays the end of the fade in and start of fade out in red.<
compare_to_average(x_cat="year",y_cat="artist_hotttnesss")
Plots raw data y, average y for each x and finds areas where average y for each x is above/below total average
Each raw datapoint represents a song. The blue line on the bottom represents the average "Artist Hotness" for each year we have raw datapoints for. Because some years are missing in between, the line is sporadic. The dotted black line represents the average of average "Artist Hotness". Green areas represent year ranges where the average for that year is above the average of the average. Red areas represent the opposite, where the average is below the average of the average.
error_bar(categories=["segments_loudness_max","segments_confidence"],data_start=[0,1],sec_i[0,100])
Plots error bars for max loudness.
The full "segment_loudness_max" array contains 791 values for datapoint 0, this image shows the first 100 and the associated confidence values.
dr(x_categories=["key","loudness","mode","tempo","year"],y_category=["artist_mbtags","artist_mbtags_count"])
Plots dimensions reduced through T-SNE and PCA.
Result of going from 5 dimensions to 2 using the following categories: "key","loudness","mode","tempo","year". Used Principal Component Analysis and t-distributed Stochastic Neighbor Embedding.
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
- Setup .tar.gz unzipper
- Currently entire dataset needs to be loaded into memory prior to doing any analysis
- Write hdf5 helper functions