Data_visualization

fordgobick_dataset

Diamonds Data Exploration

Dataset

The data consists of information regarding 54,000 round-cut diamonds, including price, carat, and other diamond qualities. The dataset can be found in the repository for R's ggplot2 library here, with feature documentation available here.

Summary of Findings

In the exploration, I found that there was a strong relationship between the price of a diamond and its carat weight, with modifying effects from the cut, color, and clarity grades given to the diamond. The relationship is approximately linear between price and carat when price is transformed to be on a logarithmic scale and carat transformed to be on a cube-root scale. I found a somewhat surprising result initially when the marginal trend for the cut, color, and clarity variables indicated that higher diamond quality was associated with lower price. However, higher diamond quality was also associated with smaller diamonds. When I isolated diamonds of a single carat weight, there was a clear positive relationship between higher diamond quality and higher diamond price.

Outside of the main variables of interest, I verified the relationship between diamond carat weight and its x, y, and z dimensions. For the dataset given, there was an interesting interaction in the categorical diamond quality features. The lower clarity grades looked like they had slightly better distribution of cut and color grades than diamonds with the higher clarity grades.

Key Insights for Presentation

For the presentation, I focus on just the influence of the four Cs of diamonds and leave out most of the intermediate derivations. I start by introducing the price variable, followed by the pattern in carat distribution, then plot the transformed scatterplot.

Afterwards, I introduce each of the categorical variables one by one. To start, I use the violin plots of price and carat across clarity. I'm only looking at the clarity grade plot here since it's the clearest example of how the categorical quality grades affect diamond pricing. The other two categorical variables, cut and color, are covered afterwards, using point plots. I've made sure to use different color palettes for each quality variable to make sure it is clear that they're different between plots.