sa-lee/easel

A grammar of aesthetics

Opened this issue · 0 comments

At the moment we define aesthetic mappings to variables with the visualise function - we define explicitly which aesthetic elements map to a variable. This eventually results in a call to dplyr::mutate to augment the data with "aes_" columns. Currently, the way the grammar of graphics is set up a user is generally required to create a long form data frame via some data manipulations (if they are long form). In ggplot2, there is a one to one mapping between aesthetics and variables, but it doesn't necessarily have to be.

Could we use scoped variants of these functions to imply these operations are being done on certain collections of variables? Can we map multiple variables to an aesthetic?

Let's consider two examples a side by side box plot and parallel coordinates plot.

Here's a fairly common matrix structure in genomics along with the ggplot specs for a boxplot:

library(tidyverse)
set.seed(100)
tbl <- tibble::tibble(gene_id = 1:30L, 
                      A1 = rnorm(30), 
                      A2 = rnorm(30), 
                      B1 = rnorm(30, mean =  0.5), 
                      B2 = rnorm(30, mean = 0.5, sd = 3))

tbl_by_expr <- tbl %>%
  gather("sample", "expression", -gene_id) 
# boxplot
tbl_by_expr %>%
  ggplot(aes(x = sample, y = expression)) +
  geom_boxplot()

The boxplot requires performing a gather call to go from long to wide and then computing summary statistics on each slice of the long form. This is computation is inefficient as the number of different variables grows. I would also argue that people do intuit wide form. An alternative would be to keep the wide form around and perform operations column wise. Here we introduce the notion of visualise_at which allows the use of the slice operator to multiple variables to place on aesthetic, in our API this could something like:

tbl %>%
    visualise_at(x = A1:B2)

i.e. we are specifying that on the x-axis we are placing all variables A1 up to B2 (i.e. one to many map), we could either implicitly change the table here via a gather call or reserve the gathering until the end of the pipeline. If we could have data.frame columns in a tibble nesting the aes_x column could provide some computational gains. A boxplot is an interesting geom too since it's a compound of points, rectangles and lines (perhaps best to just leave as draw_boxplot)

tbl %>%
     visualise_at(x = A1:B2) %>%
     summarise_box() %>% 
     # computed without gathering first using `dplyr::summarise_at` then use tidyr::gather
    draw_boxplot()

Another example is parallel coordinates plot - again this is a plot where multiple variables are mapped to a single aesthetic. There is also a scaling operation required for this plot so variables can be compared to each other.

Here's one possible way of making a PCP with ggplot2

tbl_by_expr %>%
  group_by(sample) %>%
  mutate(scaled_expression = (expression - mean(expression)) / sd(expression)) %>%
  ggplot(aes(x = sample, y = scaled_expression, group = gene_id)) +
  geom_line()

Again with our API one possible is to use compound aesthetics - how to represent the scaling options - essentially this is a mutate at each variable (should it be done before or after a visualise call), another option is to just call visualise_at with the option of including a function to modify those aesthetics (i.e. pass it down to dplyr::mutate_at then gather)

tbl %>%
     visualise_at(x = A1:B2, .f = scale) %>%
     draw_lines()