swirldev/swirl_courses

supplied vector `brk` does not reflect the bounds of each bin in the histogram

Opened this issue · 0 comments

Course: Exploratory Data Analysis
Lesson: GGPlot2 Extras
Progress: 17%

Issue: Actually, I don't see how the vector counts matches the height of each bin in qplot(price, data = diamonds, binwidth = 18497/30).

> qplot(price, data = diamonds, binwidth = 18497/30)

| Your dedication is inspiring!

  |=======                                   |  17%
| No more messages in red, but a histogram almost
| identical to the previous one! If you typed
| 18497/30 at the command line you would get the
| result 616.5667. This means that the height of
| each bin tells you how many diamonds have a price
| between x and x+617 where x is the left edge of
| the bin.

...

  |========                                  |  19%
| We've created a vector containing integers that
| are multiples of 617 for you. It's called brk.
| Look at it now.

> brk
 [1]     0   617  1234  1851  2468  3085  3702  4319
 [9]  4936  5553  6170  6787  7404  8021  8638  9255
[17]  9872 10489 11106 11723 12340 12957 13574 14191
[25] 14808 15425 16042 16659 17276 17893 18510 19127

| You are amazing!

  |=========                                 |  20%
| We've also created a vector containing the number
| of diamonds with prices between each pair of
| adjacent entries of brk. For instance, the first
| count is the number of diamonds with prices
| between 0 and $617, and the second is the number
| of diamonds with prices between $617 and $1234.
| Look at the vector named counts now.

> counts
 [1]  4611 13255  5230  4262  3362  2567  2831  2841
 [9]  2203  1666  1445  1112   987   766   796   655
[17]   606   553   540   427   429   376   348   338
[25]   298   305   269   287   227   251    97

| Your dedication is inspiring!

  |=========                                 |  22%
| See how it matches the histogram you just
| plotted? So, qplot really works!

qplot-diamonds-price

Some conflicting observations:

  • counts[2] contains the largest value; however, the plot shows the first bin should be the largest.
  • The plot shows bins 1 through 6 decreasing in value with bin 7 greater than the two preceding bins, followed by bin 8 decreasing from bin 7 but still greater than bin 6. The counts vector, if we ignore the 1st element, decreases from element 2 through 5, value 6 increases but only surpasses the previous value, and value 7 increases over value 6.

This vector does not match the histogram.

I agree with the statement

...the height of each bin tells you how many diamonds have a price between x and x+617 where x is the left edge of the bin.

But the values in brk do not reflect this.

Using the statement with values inserted, the first bin should be between 326 and 326+617 where 326 is the left edge of the first bin. Not between 0 and 617 as indicated by brk.

Using an offset of +326 to the values of brk, we get the following values for counts2 which I feel better represent the plot.

> counts2 <- numeric(30)
> for (i in seq_along(1:30)) {
+     counts2[i] <- nrow(diamonds[diamonds$price >= 326+617*(i-1) & diamonds$price < 326+617*i, ])
+ }
> counts2
 [1] 13308  6820  5214  3853  2933  2540  3021  2552  1818  1540  1264
[12]  1085   829   817   711   613   573   559   455   433   418   367
[23]   343   288   287   314   260   269   242   214

Regards,
Steve