
This program will analyze data about a large set of diamonds sold in the United State.

Primary LanguageHTML


This program will analyze data about a large set of diamonds sold in the United State.

First the data will be loaded, then we will show the basic shape and general statistics associated with the diamonds dataset. Next we will use filtering and sorting techniques to view the five most expensive diamonds, then the five least expensive. We will then show information about the five largest diamonds in the dataset, with an ideal cut. I will now be creating lists to specify the order of each of three categorical variables, cut, color, and clarity. They will be arranged from worst to best, this will allow us to do more precise analysis later. Next we will determine the number of diamonds with each level of each of the three categorical variables. We will see the relationship between the price and carat attributes by generating scatter plots. Then we will recreate this chart except we will separate them by clarity level, then split the different clarity levels into their own chart.
Next we will apply logarithmic transformations to the price and carat variables, and will explore the distribution of these variables after applying logarithmic transformation. Once this is complete we will compare a histogram of the prices versus the price after logarithmic transformation is applied, then we'll make the same comparision for carat size. Next we will use a scatterplot to visualize then transformed variables mentioned above. We'll show the relationship between log-price and log-carat by diamond clarity. Lastly we will explore the cut, color, and clarity by comparing the mean price and carat size for each level. For this comparision we will use the bar chart, using specific colors for each level.