Inconsistent behaviour of `geom_boxplot()` and `geom_violin()`
Opened this issue · 8 comments
geom_boxplot() and geom_violin() show inconsistent in behaviour; the former needs only one of x or y, while the latter insists on being supplied both. Whether or not this was intentional, it is annoying that two such similar geoms do not work consistently.
library("ggplot2")
df <- data.frame(x = rnorm(100))
# this works
df |>
ggplot(
aes(x = x)
) +
geom_boxplot()
# this doesn't, errors complaining about missing `y`
df |>
ggplot(
aes(x = x)
) +
geom_violin()
# can make geom_violin work
df |>
ggplot(
aes(x = x, y = "")
) +
geom_violin()
# geom_density works without a second variable
df |>
ggplot(
aes(x = x)
) +
geom_density()
The error with the first geom_violin() example is:
Error in `geom_violin()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `compute_layer()`:
! `stat_ydensity()` requires the following missing aesthetics: y.
Run `rlang::last_trace()` to see where the error occurred.
I expected that geom_violin() would work similarly to geom_boxplot() and not require both x and y to be specified.
This is with:
> packageVersion("ggplot2")
[1] ‘3.5.2.9002’
Full reprex:
library("ggplot2")
df <- data.frame(x = rnorm(100))
# this works
df |>
ggplot(
aes(x = x)
) +
geom_boxplot()# this doesn't, errors complaining about missing `y`
df |>
ggplot(
aes(x = x)
) +
geom_violin()
#> Error in `geom_violin()`:
#> ! Problem while computing stat.
#> ℹ Error occurred in the 1st layer.
#> Caused by error in `compute_layer()`:
#> ! `stat_ydensity()` requires the following missing aesthetics: y.
# can make geom_violin work
df |>
ggplot(
aes(x = x, y = "")
) +
geom_violin()# geom_density works without a second variable
df |>
ggplot(
aes(x = x)
) +
geom_density()Created on 2025-09-10 with reprex v2.1.1
I don't know the historical reasons why this differs, but arguably geom_boxplot() is underspecified rather than geom_violin() being overspecified. It'd make sense to me to make geom_boxplot() also require x and y, but I'm afraid this will shipwreck on reverse dependencies.
I can see why making geom_boxplot() more like geom_violin() might not be desirable this late into ggplot2's history, but would modifying geom_violin() to work like geom_boxplot() in the one aesthetic case be that egregious in the name of consistency? (I can obviously work around this issue without problem, but I'm teaching this to students tomorrow and they will fixate on why these two work differently 😢 )
Personally, I'd argue that both are currently wrong:
geom_boxplot()shouldn't use a continuous y axis in this case, andgeom_violin()should create the required variable if only one aesthetic is supplied, hacking theaeswithy = ""is cludgy and there is no y axis in this simple case, so inventing one in data just to make it work seems wrong.
I wondered whether the after_stat() mechanism might be used here to give a created factor variable in the single aesthetic case or the factor from the data/aesthetic if is is supplied?
OK after a bit of digging I found out the boxplot behaviour is due to a request from Hadley: #2110.
The same request for the same treatment of violin plots has not come in until today 🤷
I wondered whether the after_stat() mechanism might be used here to give a created factor variable in the single aesthetic case or the factor from the data/aesthetic if is is supplied?
after_stat() with discrete values for position scales will not work: discrete positions aesthetics are transformed to continuous ones before the stat computation occurs. The stat computation thus takes in continuous values and is also expected to return continuous values for position aesthetics.
The same request for the same treatment of violin plots has not come in until today 🤷
Yeah, as one of the respondents in #2110 said, how many times is one plotting a single boxplot [violin plot] --- but I would still argue there is a place for handling this case nicely even if many data sets people encounter will likely have some grouping variable be part of the plot.
Having read #2110, there were some unresolved issues with the implementation of Hadley's request back then, the most notable one being that the fix made the axis numeric, which adds unnecessary plot furniture.
Thanks for explaining the relative order of operations re after_stat()/position scales.
Yeah I agree that it is unecessary plot furniture to have a break at x = 0, but as noted in #2292 (comment):
However, I couldn't find a clean way of changing it from a continuous scale from within the geom/stat ggprotos without hard coding the exceptions into the layer building process.
And I don't expect to find a clean way either due to order of operations of position transformation and stat computation.
I just had a student stumble upon this as well as how odd the geom_boxplot/violin behaviors are when x is a continuous variable (which seems like it should at least generate a warning?).
library(tidyverse)
penguins |>
ggplot(aes(x = body_mass, y= bill_len, fill = species)) +
geom_boxplot()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`stat_boxplot()`).penguins |>
ggplot(aes(x = body_mass, y= bill_len, fill = species)) +
geom_violin()
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_ydensity()`).
#> Warning: `position_dodge()` requires non-overlapping x intervals.Created on 2025-10-04 with reprex v2.1.1
@rbcavanaugh This looks like a different problem to me and maybe should be filed as a separate issue? When x is a continuous variable it's unclear what boxplot or violin to draw and where, since it doesn't make sense (in general) to group on a continuous variable and so the grouping is not aligned with the variable plotted along the x axis. I agree a warning or error might be better than just drawing something that is likely wrong or non-sensical.
I agree with OP. Both functions should have consistent behaviour. To avoid breaking reverse dependencies, perhaps a new function, such as geom_distribution(), could be introduced with a type argument (either "box" or "violin"). This unified function could then supersede the existing ones, ensuring consistent handling for both plot types.
If a single function for two geometries goes against the ggplot design philosophy, alternative functions like geom_boxplot2() and geom_violin2() could be introduced instead.
If having more than 1 function that does the same thing is also an issue, then updating the existing functions might be the best option. Consistent behaviour should take priority. Software evolves, and developers who rely on the current implementation can adapt during a well-managed deprecation period with appropriate warnings.




