- Sam Wilson, samw11@cs
- Robert Thompson, robthomp@cs
You can access our visualization at http://homes.cs.washington.edu/~samw11/ (it may run slowly the first time) or download the repository and run it on a server with PHP installed. We used the MAMP package during development if you don't have your own server.
This visualization is an attempt to allow users to explore a very large dataset of songs along what we found to be interesting channels. Specifically, by the year the songs were recorded, their artist, and their artist's rating.
The visualization is divided into three layers. The first layer is a pair of polar scatterplots of artists binned on year, popularity rating, and hotness rating. Point size is dependent on the number of artists in that bin. Points are shown twice on the upper and lower half-circle graphs. Year decides radius for both but the top graph maps hotness while the bottom graph maps familiarity onto angle. Mousing over a point will show a tooltip with that point's year, hotness, familiarity, and the number of artists in that bin. The corresponding point on the opposing half-circle graph will also automatically be highlighted.
Clicking a point will transition to the second layer, a bubble graph, that shows more detail about the artists in that bin, including the number of songs they released that year and the average duration of those songs. Mousing over a bubble will display the full artist name, the number of songs they produced that year, and the average duration of those songs in seconds.
Clicking an artist bubble will display the full song information the database has on them, including all of their songs sorted by year then title and each song's duration. Clicking outside of the second layer bubble graph will transition back to the first visualization layer scatterplot.
The data domain we used for our visualization is the Million Song Dataset. The dataset contains collections of audio features and metadata for a million contemporary popular music tracks. Since the full dataset is very large, approximately 300GB, we eventually decided to use a subset with most of the metadata filtered out to only the song title, album, artist name, duration of the song in seconds, artist familiarity, artist hotness and the year of the song. The final song total is approximately 500,000, half of the original database.
Here are a few tableau visualizations of the full dataset that influenced our design decisions.
A measure of the number of songs produced per year.
A bubble graph of all artists in the dataset. Bubble size is the number of songs while color shows artist hotness.
A scatterplot of song artist hotness vs year.
A scatterplot of song duration vs year.
Based on the dataset, features presented above, we wanted to build a visualization that could accommodate the large number of records in the dataset and still allow users to explore the data fully. We decided pretty quickly to have several hierarchical visualization layers, each reducing the dataset further, until we could eventually display individual song titles. Our task was then choosing dimensions that had interesting trends on their own but could also reduce the data significantly when fixed. We found the increasing variety of song duration and artist ‘Hotness/Familiarity’ over time to be some of the more interesting trends and decided to highlight those features. Our sketches below show different attempts at visualizing duration but the vast majority of songs were in a narrow range (2-10mins) and songs within that range exhibited few interesting trends. Eventually we decided to use the hotness and familiarity measures instead, and to show them in scatterplot. We also liked the bubble graph of artists but felt it was far too dense when displaying the whole dataset. Thus, we decided to use a bubble graph as a secondary visualization layer after filtering the data based on hotness, familiarity, and year. Once we had a useable bubble graph of artists, a third layer simply listing the artist’s songs felt obvious.
Apologies for the poor quality. These were done on physical paper in pencil and then scanned.
First, a selection of design sketches that did not end up being implemented.
A map showing artist location data (from another dataset we did not use).
Stacked area graph with data binned by song duration.
Text field to search by artist name.
Radial histogram binned by song duration.
Another version of the radial histogram, this time also showing a histogram of popularity within each duration bin.
Now the sketches that became the basis for the final product.
The radial scatterplot used as the first visualization layer.
An earlier sketch showing a cartesian scatterplot but also showing the bubble graph second layer and text window third layer.
A sketch of the zooming animation that occurs when clicking a bubble and the corresponding animation in the text window.
Below is the first draft of the first layer visualization. The x axis is the year and the y axis is the number of hotness / familiarity. As you can see, there are a lot of clusters. In addition, there are many overlapping with each other.
At this point we realized the dataset was still too large and had to devise methods to lessen the load on the browser. We decided to use PHP to parse the necessary information and only then send it to the D3 visualization code running in the browser. PHP proved to be too slow to parse the data so we opted to bin the data based on our first visualization layer dimensions: year, hotness, and familiarity, and create separate data files for each bin. This reduced the number of points we had to draw on the first visualization layer scatter plot and also dramatically sped up the loading time for the second and third layers. To create a more consistent look, we also changed the scatter plot to polar coordinates to better match the bubble graph.
We also discovered that there are some inconsistencies in the dataset. For example, we assumed that the artist familiarity number would be the same for all entries. However, that is not true. We find out that the artist familiarity numbers are not always the same with the same artist. Similar errors occur for artist hotness. Since there are too many artists, we were unable to correct the dataset and had to keep the errors in.
Below shows the polar coordinates version of the first layer. Compared with the scatterplot version, we thought this visualization looked much better. We also started to work on the color at this point. After the color lecture, we both agreed not to use blue for the color scale. The final color schemes were chosen with the (Color Brewer 2)[http://colorbrewer2.org/] website and we decided to use red on the first layer and green on the second layer. Regarding the background color, we originally chose black as a background color for the first layer. It looked cool but did not work for the second layer. Since we needed consistency between layers we end up using white as a background color for everything.
The second and third layer design did not change much during development. Our primary concern for them was to reduce the dataset enough in the first layer so that each bubble was large enough to be seen. This meant we spent most of our time making sure the first layer did just that.
Finalization mostly involved merging the two visualization layers we had developed separately and creating a consistent look between them. Transitions had to suggest the first layer was ‘zooming in’ to the second layer and the opposite when going from the second to the first layer.
There are many changes from the initial sketches to the final visualization. There were a lot of issues with the data we did not anticipate, particularly size. As a result, we had to filter and bin the dataset in order to let our visualization run smoothly.
Regarding the first layer, we decided to use scatterplot from the start. After looking at the first draft of the first layer, we decided to change it to a variation of a bubble chart as bubble is the main theme of our visualization.
The full database and the first subset we attempted to use were in a format foreign to both of us: HDF5. We actually spent most of the first week just trying to get the data into a format D3 could read. Eventually we had to give up and go with a different subset that contained fewer data fields but was already in CSV format. Once we had the data, as mentioned above, the sheer size of it was a constant concern and obstacle. We didn’t anticipate having to make a PHP back-end at first, or having to bin the data into separate files. Handling all of this, including arranging the data ourselves and writing scripts to parse it, took a sizeable portion of our time.
Writing the actual visualization code was fairly straightforward and quick thanks to very helpful D3 examples. Styling the visualizations away from their default look took a long time, however, as did making transitions between the different layers.
Time spent:
- Converting dataset format: 20 hours
- Filtering and binning the dataset: 12 hours
- Dataset backend scripts: 6 hours
- Design layout: 6 hours
- Learning and implementation: 24 hours
- Styling and coloring: 5 hours
- Fix bugs: 2 hour
- Refactoring: 1 hour
Sam:
- 1st visualization layer
- Filter and bin the dataset
- Transition between layer 1 and 2
Rob:
- 2nd and 3rd visualization layers.
- Backend PHP scripts for 2nd and 3rd layers.
- Also worked on 1->2 and 2->1 transitions
- D3
- D3-tip
- jQuery
- D3’s Zoomable Circle Packing Example
- D3’s Bar Chart with Tooltip Example
- D3’s Bubble Chart Example