text is misplaced with position_dodge()

In the example below, I would expect all of the text labels to be positioned perfectly on top of the data points. Instead, some of the text labels are not positioned correctly.

I think the issue is due to position_dodge(). I'm not sure exactly where to look to find the relevant code.

In the last example, I use ggrepel to help illustrate the problem more clearly. You can see the blue labels 34 and 290 are not pointing to the correct positions. It seems like they're pointing to the "undodged" positions instead of the "dodged" positions.

This issue was originally reported by @raviselker in ggrepel issues: slowkow/ggrepel#122

library(tidyverse)
library(ggrepel)
# remotes::install_github("thomasp85/patchwork)
library(patchwork)

set.seed(1337)

df <- tibble(
  x = rnorm(500),
  g1 = factor(sample(c("A", "B"), 500, replace = TRUE)),
  g2 = factor(sample(c("A", "B"), 500, replace = TRUE)),
  rownames = 1:500
)

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

df_outliers <- df %>% group_by(g1, g2) %>% mutate(outlier = is_outlier(x))

p1 <- ggplot(df_outliers, aes(x = g1, y = x, fill = g2)) +
  geom_boxplot(width = 0.3, position = position_dodge(0.5))

p2 <- p1 +
  geom_text(
    data = . %>% filter(outlier),
    mapping = aes(label = rownames),
    position = position_dodge(0.5)
  )

p1 + p2

ggplot(df_outliers, aes(x = g1, y = x, fill = g2)) +
  geom_boxplot(width = 0.3, position = position_dodge(0.5)) +
  ggrepel::geom_label_repel(
    min.segment.length = 0,
    data = . %>% filter(outlier),
    mapping = aes(label = rownames),
    position = position_dodge(0.5)
  )

^{Created on 2018-12-02 by the reprex package (v0.2.1)}

The underlying principle is that dodging doesn't work as one might expect when some data groupings don't exist.

library(ggplot2)
df <- data.frame(
  x = c("A", "A", "B"),
  type = c("a", "b", "a")
)

ggplot(df, aes(x, 1, color = type)) +
  geom_point(position = position_dodge(width = .5), size = 5)

^{Created on 2018-12-02 by the reprex package (v0.2.1)}

I'm not sure this can be fixed with the current positioning approach, because the position adjustments never see the entire dataset. The question is whether we can come up with some delicate surgery that fixes this problem without completely changing how position adjustments work.

Maybe I spoke too soon. It appears that the various position functions do receive the entire dataset, at least the dataset per panel:

ggplot2/R/position-.r

Lines 16 to 34 in 5e4a6ef

    
           #'   - `compute_layer(self, data, params, panel)` is called once 
        
           #'     per layer. `panel` is currently an internal data structure, so 
        
           #'     this method should not be overridden. 
        
           #' 
        
           #'   - `compute_panel(self, data, params, panel)` is called once per 
        
           #'     panel and should return a modified data frame. 
        
           #' 
        
           #'     `data` is a data frame containing the variables named according 
        
           #'     to the aesthetics that they're mapped to. `scales` is a list 
        
           #'     containing the `x` and `y` scales. There functions are called 
        
           #'     before the facets are trained, so they are global scales, not local 
        
           #'     to the individual panels. `params` contains the parameters returned by 
        
           #'     `setup_params()`. 
        
           #'   - `setup_params(data, params)`: called once for each layer. 
        
           #'      Used to setup defaults that need to complete dataset, and to inform 
        
           #'      the user of important choices. Should return list of parameters. 
        
           #'   - `setup_data(data, params)`: called once for each layer, 
        
           #'      after `setup_params()`. Should return modified `data`. 
        
           #'      Default checks that required aesthetics are present.

So this should be fixable. The relevant code is here:

ggplot2/R/position-dodge.r

Lines 117 to 156 in 23a23cd

    
             compute_panel = function(data, params, scales) { 
        
               collide( 
        
                 data, 
        
                 params$width, 
        
                 name = "position_dodge", 
        
                 strategy = pos_dodge, 
        
                 n = params$n, 
        
                 check.width = FALSE 
        
               ) 
        
             } 
        
           ) 
        
           # Dodge overlapping interval. 
        
           # Assumes that each set has the same horizontal position. 
        
           pos_dodge <- function(df, width, n = NULL) { 
        
             if (is.null(n)) { 
        
               n <- length(unique(df$group)) 
        
             } 
        
             if (n == 1) 
        
               return(df) 
        
             if (!all(c("xmin", "xmax") %in% names(df))) { 
        
               df$xmin <- df$x 
        
               df$xmax <- df$x 
        
             } 
        
             d_width <- max(df$xmax - df$xmin) 
        
             # Have a new group index from 1 to number of groups. 
        
             # This might be needed if the group numbers in this set don't include all of 1:n 
        
             groupidx <- match(df$group, sort(unique(df$group))) 
        
             # Find the center for each group, then use that to calculate xmin and xmax 
        
             df$x <- df$x + width * ((groupidx - 0.5) / n - .5) 
        
             df$xmin <- df$x - d_width / n / 2 
        
             df$xmax <- df$x + d_width / n / 2 
        
             df 
        
           }

It appears that the various position functions do receive the entire dataset, at least the dataset per panel

I'm afraid not. Position$compute_panel() is called from Position$compute_layer(), and Position$compute_layer() is called from Layer$compute_position(), which is called per layer with each layer's data. So, it doesn't know the other layer's data.

ggplot2/R/plot-build.r

Line 77 in 23a23cd

data <- by_layer(function(l, d) l$compute_position(d, layout))

BTW, I feel this description is not quite right. Maybe, "once per panel per layer"?

ggplot2/R/position-.r

Lines 20 to 21 in 5e4a6ef

    
           #'   - `compute_panel(self, data, params, panel)` is called once per 
        
           #'     panel and should return a modified data frame.

But that should still be good enough to get the dodging right within each layer and panel. I think the other problem is that we're not using an explicit dodging aesthetic. position_dodge() simply finds all distinct groups at each x position and spreads them out. If we gave it an explicit aesthetic, e.g. aes(dodge = type), or maybe as an optional argument to position_dodge(), e.g. position_dodge(dodge_by = type), then the position adjustment could make smarter decisions about where to place which data points.

Here is another example, building on Claus' code.

It seems that color and fill are not treated the same way by ggplot2. I found this surprising and unexpected -- perhaps this is intended behavior?

library(ggplot2)
df <- data.frame(
  x = c("A", "A", "B"),
  type = c("a", "b", "a")
)

pos <- position_dodge(width = 0.5)

p <- ggplot(df) +
  geom_point(position = pos, shape = 21, size = 10, stroke = 1) +
  geom_text(aes(label = type), color = "black", position = pos)

p + aes(x, 1, color = type)

p + aes(x, 1, color = type, group = type)

p + aes(x, 1, fill = type)

^{Created on 2018-12-02 by the reprex package (v0.2.1)}

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS High Sierra 10.13.6   
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2018-12-02                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib
#>  assertthat    0.2.0      2017-04-11 [1]
#>  backports     1.1.2      2017-12-13 [1]
#>  base64enc     0.1-3      2015-07-28 [1]
#>  bindr         0.1.1      2018-03-13 [1]
#>  bindrcpp      0.2.2      2018-03-29 [1]
#>  callr         3.0.0      2018-08-24 [1]
#>  cli           1.0.1      2018-09-25 [1]
#>  colorspace    1.3-2      2016-12-14 [1]
#>  crayon        1.3.4      2017-09-16 [1]
#>  curl          3.2        2018-03-28 [1]
#>  desc          1.2.0      2018-05-01 [1]
#>  devtools      2.0.1      2018-10-26 [1]
#>  digest        0.6.18     2018-10-10 [1]
#>  dplyr         0.7.8      2018-11-10 [1]
#>  evaluate      0.12       2018-10-09 [1]
#>  fs            1.2.6      2018-08-23 [1]
#>  ggplot2     * 3.1.0.9000 2018-12-02 [1]
#>  glue          1.3.0      2018-07-17 [1]
#>  gtable        0.2.0      2016-02-26 [1]
#>  htmltools     0.3.6      2017-04-28 [1]
#>  httr          1.3.1      2017-08-20 [1]
#>  knitr         1.20       2018-02-20 [1]
#>  labeling      0.3        2014-08-23 [1]
#>  lazyeval      0.2.1      2017-10-29 [1]
#>  magrittr      1.5        2014-11-22 [1]
#>  memoise       1.1.0      2017-04-21 [1]
#>  mime          0.6        2018-10-05 [1]
#>  munsell       0.5.0      2018-06-12 [1]
#>  pillar        1.3.0      2018-07-14 [1]
#>  pkgbuild      1.0.2      2018-10-16 [1]
#>  pkgconfig     2.0.2      2018-08-16 [1]
#>  pkgload       1.0.2      2018-10-29 [1]
#>  plyr          1.8.4      2016-06-08 [1]
#>  prettyunits   1.0.2      2015-07-13 [1]
#>  processx      3.2.0      2018-08-16 [1]
#>  ps            1.2.1      2018-11-06 [1]
#>  purrr         0.2.5      2018-05-29 [1]
#>  R6            2.3.0      2018-10-04 [1]
#>  Rcpp          1.0.0      2018-11-07 [1]
#>  remotes       2.0.2      2018-10-30 [1]
#>  rlang         0.3.0.1    2018-10-25 [1]
#>  rmarkdown     1.10       2018-06-11 [1]
#>  rprojroot     1.3-2      2018-01-03 [1]
#>  scales        1.0.0      2018-08-09 [1]
#>  sessioninfo   1.1.1      2018-11-05 [1]
#>  stringi       1.2.4      2018-07-20 [1]
#>  stringr       1.3.1      2018-05-10 [1]
#>  testthat      2.0.1      2018-10-13 [1]
#>  tibble        1.4.2      2018-01-22 [1]
#>  tidyselect    0.2.5      2018-10-11 [1]
#>  usethis       1.4.0      2018-08-14 [1]
#>  withr         2.1.2      2018-03-15 [1]
#>  xml2          1.2.0      2018-01-24 [1]
#>  yaml          2.2.0      2018-07-25 [1]
#>  source                            
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.1)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  Github (tidyverse/ggplot2@23a23cd)
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#>  CRAN (R 3.5.0)                    
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

@slowkow What you're seeing is color = "black" shadowing the color aesthetic in the text layer. Apparently the label aesthetic is not considered when groups are calculated.

library(ggplot2)
df <- data.frame(
  x = c("A", "A", "B"),
  type = c("a", "b", "a")
)

pos <- position_dodge(width = 0.5)

p <- ggplot(df) +
  geom_point(position = pos, shape = 21, size = 10, stroke = 1) +
  geom_text(aes(label = type), position = pos)

p + aes(x, 1, color = type)

^{Created on 2018-12-02 by the reprex package (v0.2.1)}

Yes, labels are not considered when calculating grouping, and that is done by design. (Presumably because it's not uncommon for labels to be all different even within a group.)

ggplot2/R/grouping.r

Lines 7 to 10 in 1c09bae

    
           # If the `group` variable is not present, then a new group 
        
           # variable is generated from the interaction of all discrete (factor or 
        
           # character) vectors, excluding `label`. The special value `NO_GROUP` 
        
           # is used for all observations if no discrete variables exist.

to get the dodging right within each layer and panel.

Sorry, I don't get the point yet... Are we talking about the inconsistency of the positions between layers, not within each layer, right?

Letting positions to have aesthetics sounds cool to me, which you've also indicated in #2977 (comment).

I am talking within each layer. I think there should be an option that guarantees that dodging always looks the same across all x values. In the example here, we would want type = "a" always be dodged to the left and type = "b" always be dodged to the right, regardless of whether the other type is present at a given x or not. As a side effect, this would fix the original problem.

On a related note, see this closed PR that wasn't merged, and the issue of violins moving in the wrong spot under preserve = "single": #2813

It's the same problem. The dodging doesn't know about the variable that it is dodging by, and therefore it does strange things.

Thanks, I got what you mean. It's still unclear to me how to map groups to dodged positions without training over all layers,, but I think I'll find it later :)

In case this is still useful, here's another version of reprex which I believe is minimal for this issue:

library(ggplot2)

d <- data.frame(x = c("x", "x"), g = c("a", "b"), stringsAsFactors = FALSE)
pos <- position_dodge(width = .5)

ggplot(mapping = aes(x, 0, colour = g, label = g)) +
  geom_point(data = d, size = 5, position = pos) +
  geom_label(data = d[2, ], size = 5, position = pos)

^{Created on 2018-12-03 by the reprex package (v0.2.1)}

I think there should be an option that guarantees that dodging always looks the same across all x values. In the example here, we would want type = "a" always be dodged to the left and type = "b" always be dodged to the right, regardless of whether the other type is present at a given x or not. As a side effect, this would fix the original problem.

This has been requested before in #2076 and I agree that it would be a nice feature to have, though if I remember correctly it would require some significant refactoring. We'd also have to think through how geoms with different widths across groups would get placed (e.g. box plots with varwidth = TRUE). For this reason I don't know that fixing this would solve the original problem unless the position calculation knew about other layers. One of the things that's tricky about dodging points and labels in particular is that they have no width in the data space, so the position calculations that calculate where things go based on width don't work right.

Is this the same issue as #2480?

yes I think so

	#' - `compute_layer(self, data, params, panel)` is called once
	#' per layer. `panel` is currently an internal data structure, so
	#' this method should not be overridden.
	#'
	#' - `compute_panel(self, data, params, panel)` is called once per
	#' panel and should return a modified data frame.
	#'
	#' `data` is a data frame containing the variables named according
	#' to the aesthetics that they're mapped to. `scales` is a list
	#' containing the `x` and `y` scales. There functions are called
	#' before the facets are trained, so they are global scales, not local
	#' to the individual panels. `params` contains the parameters returned by
	#' `setup_params()`.
	#' - `setup_params(data, params)`: called once for each layer.
	#' Used to setup defaults that need to complete dataset, and to inform
	#' the user of important choices. Should return list of parameters.
	#' - `setup_data(data, params)`: called once for each layer,
	#' after `setup_params()`. Should return modified `data`.
	#' Default checks that required aesthetics are present.

	compute_panel = function(data, params, scales) {
	collide(
	data,
	params$width,
	name = "position_dodge",
	strategy = pos_dodge,
	n = params$n,
	check.width = FALSE
	)
	}
	)

	# Dodge overlapping interval.
	# Assumes that each set has the same horizontal position.
	pos_dodge <- function(df, width, n = NULL) {
	if (is.null(n)) {
	n <- length(unique(df$group))
	}

	if (n == 1)
	return(df)

	if (!all(c("xmin", "xmax") %in% names(df))) {
	df$xmin <- df$x
	df$xmax <- df$x
	}

	d_width <- max(df$xmax - df$xmin)

	# Have a new group index from 1 to number of groups.
	# This might be needed if the group numbers in this set don't include all of 1:n
	groupidx <- match(df$group, sort(unique(df$group)))

	# Find the center for each group, then use that to calculate xmin and xmax
	df$x <- df$x + width * ((groupidx - 0.5) / n - .5)
	df$xmin <- df$x - d_width / n / 2
	df$xmax <- df$x + d_width / n / 2

	df
	}

	# If the `group` variable is not present, then a new group
	# variable is generated from the interaction of all discrete (factor or
	# character) vectors, excluding `label`. The special value `NO_GROUP`
	# is used for all observations if no discrete variables exist.