Million Song Dataset Data Exploration

Exploratory data analysis and visualization of the “Million Songs Dataset” mainly using Python, Pandas and Matplotlib.

Notes:

Dataset retrieved from: http://labrosa.ee.columbia.edu/millionsong/
HDF5 getter functions provided/taken from: https://github.com/tbertinmahieux/MSongsDB/blob/master/PythonSrc/hdf5_getters.py
All analysis is on the sample set
NOTE: Written in python 2.7 due to inconsistencies with the python 3 version of the tables module and the hdf5 getter functions provided by Thierry Bertin-Mahieux.

Setup

If you want to run the code:

>>> git clone https://github.com/eltonlaw/msd_data_exploration.git

Then download the dataset and move the downloaded and unzipped dataset into the repo

>>> mv MillionSongSubset msd_data_exploration/MillionSongSubset

Also need to have the hdf5 helper functions from Thierry Bertin-Mahieux's GitHub repo

>>> cd msd_data_exploration
>>> git clone https://github.com/tbertinmahieux/MSongsDB

Make a temp folder for the output graphs:

>>> mkdir temp

Your directory should look something like this now:

msd_data_exploration
│   grab_data.py
│   MillionSongsSubset
│   model.py
└───MSongsDB
│   │   ...
│   └───Python Src
│       │   hdf5_descriptors.py
│       │   hdf5_getters.py
│       │   ...
|   README.md
|   run.py
|   scrape_categories.py
└───temp

Now you should be able to run the analysis with this command:

python -m run.py

Initial Analysis

basic_info(categories=["tempo","duration","key","time_signature","song_hotttnesss"])

Print skew, distribution and pairwise correlation for the 5 following categories: tempo, duration, key, time_signature, song_hotttnesss.

Pairwise Correlations

The elements tested are linearly independent.

Distribution

Skew

Artist Locations from latitude/longitude

world_plot(lat="artist_latitude",lon="artist_longitude")

Plot the latitude and longitude of each artist.

Most of the datapoints are coming from North America and EU, the subset data is not representative of the population.

Normalized % Frequency for each year

freq_plot(category="year")

Plots the normalized frequency of songs for each year in ascending order

Duration of Songs

stacked_bar_plot(full="duration",head_end="end_of_fade_in",tail_start="start_of_fade_out")

Plots the song duration in black and overlays the end of the fade in and start of fade out in red.<

Average 'artist_hotttnesss' Over Time

compare_to_average(x_cat="year",y_cat="artist_hotttnesss")

Plots raw data y, average y for each x and finds areas where average y for each x is above/below total average

Each raw datapoint represents a song. The blue line on the bottom represents the average "Artist Hotness" for each year we have raw datapoints for. Because some years are missing in between, the line is sporadic. The dotted black line represents the average of average "Artist Hotness". Green areas represent year ranges where the average for that year is above the average of the average. Red areas represent the opposite, where the average is below the average of the average.

Segment Max Loudness

error_bar(categories=["segments_loudness_max","segments_confidence"],data_start=[0,1],sec_i[0,100])

Plots error bars for max loudness.

The full "segment_loudness_max" array contains 791 values for datapoint 0, this image shows the first 100 and the associated confidence values.

Dimensionality Reduction

dr(x_categories=["key","loudness","mode","tempo","year"],y_category=["artist_mbtags","artist_mbtags_count"])

Plots dimensions reduced through T-SNE and PCA.

Result of going from 5 dimensions to 2 using the following categories: "key","loudness","mode","tempo","year". Used Principal Component Analysis and t-distributed Stochastic Neighbor Embedding.

Citations

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

TO DO's

Setup .tar.gz unzipper
Currently entire dataset needs to be loaded into memory prior to doing any analysis
Write hdf5 helper functions

pluttgens/msdvis