matloff/TidyverseSkeptic

simplistic example of base R v. dplyr

ljanda opened this issue Β· 23 comments

You write:

The Tidyverse also makes heavy use of magrittr pipes, e.g. writing the function composition h(g(f(x))) as

f(x) %>%  g() %>% h()

Again, the pitch made is that this is "English," in this case reading left-to-right. But again, one might question just how valuable that is, and in any event, I personally tend to write such code left-to-right anyway, without using pipes:

a <- f(x)
b <- g(a)
h(b)

This simplistic example does not demonstrate the pain point of stopping and assigning rather than piping and the improved readability that follows, as demonstrated below:

library(tidyverse)
library(knitr)
library(kableExtra)

data(diamonds)

# tidyverse

diamonds %>%
  group_by(cut) %>%
  summarise(Q1 = round(quantile(price, 1/4), 2),
            Median = round(median(price), 2),
            Mean = round(mean(price), 2),
            Q3 = round(quantile(price, 3/4), 2),
            Max = round(max(price), 2)) %>%
    kable(format = "html", format.args = list(big.mark = ','),
          col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max")) %>%
    kable_styling(full_width = FALSE, position = "left")


# base R - there are several ways to do this, this is a shorter one

diamonds_split <- split(diamonds, f = list(diamonds$cut))

result <- do.call(rbind, lapply(diamonds_split, function(x) {
    data.frame(Q1 = round(quantile(x$price, 1/4), 2),
               Median = round(median(x$price), 2),
               Mean = round(mean(x$price), 2),
               Q3 = round(quantile(x$price, 3/4), 2),
               Max = round(max(x$price), 2))
    }
  )
)

result <- data.frame(cut = row.names(result), result)

k1 <- kable(result, format = "html", format.args = list(big.mark = ','),
          col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max"))

kable_styling(k1, full_width = FALSE, position = "left")

As you can see, the base R approach requires a deeper understanding of functions, the ability to use a less clear syntax, and the need to keep assigning rather than piping.

You've never heard of tapply()?

I haven’t heard of tapply() and also fail to see how it is a relevant response given that you don’t seem to use it in the example referenced above. Maybe providing a counter example would prove to be more pedagogically useful rather than responding with a question about an obscure function?

The issue that you used a trivial example and did not represent piping well still holds. Instead of addressing the issue you're trying make me feel bad for not using a different approach. For what it's worth I used tapply before I had even heard of the tidyverse and I clearly stated that there are multiple solutions in base R.

Here is a link to many examples of base R and tidyverse code comparisons, in general you can see that the code is more readable and pipes are useful (which becomes even more apparent when combining several functions to clean a dataset): https://tavareshugo.github.io/data_carpentry_extras/base-r_tidyverse_equivalents/base-r_tidyverse_equivalents.html

Also, you state that debugging is harder with pipes - this is not true since you can easily run smaller parts of piped code.

As I see this, @matloff was only addressing the extensive use of pipes rather comparing the whole tidyverse vs base R (in this particular example at least), while you provided a specific example which contains grouping operations- which I think that most of us will agree- is not base R strongest side.
This is (probably) one of the reasons data.table was created too. I think your concern about "simplicity" should be addressed to the above comparison of data.table vs dplyr.

On the other hand, if we stick with dplyr (tidyverse) vs. base R, we could also bring many examples where base R is much simpler than the dplyr idiom- you just conveniently picked one that matches the point you are trying to make, @ljanda

@DavidArenburg,
It would be useful to provide examples to support your case rather than speaking in generalities. You claim that there are many examples where base R functions are much simpler than the dplyr idiom, but provide no examples of such cases or any description of what qualifies something as simpler from your perspective. Essentially, you are picking convenient phrases with overly general terms to support a non-falsifiable claim you seek to advance. It also isn’t clear which above comparison between data.table and dplyr you are referencing as you seem to be the only one who has referenced data.table in this issue; perhaps provide a reference to the other comment/thread/issue to which you may have been referring?

@DavidArenburg my example doesn't just include grouping operations - it also has the kable and kableExtra styling to render a nice table, showing that you can pipe the content of the grouping into the table styling functions whereas without the pipes you have to stop and assign multiple times. My point was that @matloff used a trivial example rather than something more meaty that actually shows a difference between the tidyverse and base R. I could have given even more complicated examples that used the full suite of dplyr functions and piping (eg selecting a few variables, mutating them, grouping, then summarizing, without having to stop to assign once), but here I gave a fairly simple one.

@ljanda: Your point about debugging is exactly what I am saying: It's better to break things up as in my example.

@ljanda I don't see anything special with making intermediate assigns. And I don't think pipes are really related to tidyverse anyway. You can pipe base R and data.table too if you really want to. And neither I think (my own opinion) that pipes are bad and some times they are even useful, but after spending about 5 years seeing all kind of questions and answers on StackOverflow, I see that in general, pipes are being abused by tidyverse users all the time.

For instance, I find this absolutely ridiculous.
I mean dataframe %>% select(text) %>% unlist() %>% .[4]? Seriously? Just dataframe[4, "text"] is not cool anymore? I see these nonsense all over the place.

@ljanda, thanks for the Tavares reference. The example is indeed one in which tapply is much clearer, more compact and more straightforward. I've added it to my essay.

With debugging you are breaking things up regardless or whether you're running part of a pipe or parts of unpiped code

You don't have to revert to base R, you can just run part of the pipe.

@ljanda I don't see anything special with making intermediate assigns. And I don't think pipes are really related to tidyverse anyway. You can pipe base R and data.table too if you really want to. And neither I think (my own opinion) that pipes are bad and some times they are even useful, but after spending about 5 years seeing all kind of questions and answers on StackOverflow, I see that in general, pipes are being abused by tidyverse users all the time.

For instance, I find this absolutely ridiculous.
I mean dataframe %>% select(text) %>% unlist() %>% .[4]? Seriously? Just dataframe[4, "text"] is not cool anymore? I see these nonsense all over the place.

@DavidArenburg
There definitely can be differences between pipes and multiple assignments. The pipes are an abstraction layer for the syntax, while multiple assignment means consuming more memory on the system to store additional objects which may or may not be necessary to store (temporarily or otherwise). I agree that the example you posted is completely ridiculous without a doubt. However, I've also encountered cases where analysts will create several different copies of essentially the same object or will repeatedly overwrite the same object multiple times:

masterDF <- merge(Student_Teacher_Link, Student_Attrib, by = c("STUDENT_ID"))
masterDF <- merge(masterDF, Core_Courses,  by = c("TID", "CID", "SCHOOL_YEAR", "SCHOOL_NAME", "SCHOOL_CODE"))
masterDF <- merge(masterDF, Student_Sch_Yr, by = c("STUDENT_ID", "SCHOOL_YEAR"))
names(masterDF)[names(masterDF) == "S_GRADE_CODE"] <- "GRADE_CODE"

################################ Subset Elementary teachers...they are in both math and reading data sets #############
EL <- masterDF[which(masterDF$GRADE_CODE == "01" | masterDF$GRADE_CODE == "02" |
                       masterDF$GRADE_CODE == "03" | masterDF$GRADE_CODE == "04" |
                       masterDF$GRADE_CODE == "05"), ]

######################################## Subset teachers with math/ELA (reading) indicators ########################
forMath <- masterDF[which(masterDF$MATH == "1" & masterDF$CORE == "Yes"), ]
forRead <- masterDF[which(masterDF$ELA == "1" & masterDF$CORE == "Yes"), ]

############################### Stack math/reading with elementary teachers remove dups ##########################
forMath <- rbind(EL, forMath)
forRead <- rbind(EL, forRead)

forRead <- merge(forRead, Student_Scores, by = c("STUDENT_ID", "SCHOOL_YEAR"))
forMath <- merge(forMath, Student_Scores, by = c("STUDENT_ID", "SCHOOL_YEAR"))

That's an example from someone I work with. It isn't representative of the population, but it also highlights an issue with users who aren't terribly versed in programming.

@wbuchanan this is a great example and a really good point about multiple assignment consuming more memory
@DavidArenburg I agree the example you gave is not a great use of pipes and that they can be abused but the pipe is part of the magrittr package, which is technically part of the tidyverse
image
Also, I think it is worth pointing out that people often build up their pipes - I usually add line by line and check outcomes along the way (which actually helps with debugging)

@ljanda magrittr wasn't originally part of tidyverse- it was basically contributed to it: tidyverse/magrittr@cf2e33f

Regarding your debugging strategy, it basically means that you need to rerun your whole code over and over after adding each line which will probably be time/memory consuming.

@wbuchanan I think in your example it is better to persist each step like your co-worker did instead of piping it up which would probably get an out of memory error.

Also, if you work with data.table, each merge would update the data in place and both save speed and memory and avoid piping.

Finally, if someone would pipe all these join and would like to pipe additional steps, he would need to run all of these join each time he would add additional step which would be time/memory mess.

All in all (if we ignore the code cleanliness), piping would probably make it worse (in my opinion a least).

Feels like this discussion is missing a few things...

First - even the example by @ljanda is quite simplistic and can be achieved with base in an easier way:

# base
result <- aggregate(price ~ cut, data=diamonds, FUN=function(x) round(summary(x)[-1],2))
result <- kable(result, format = "html", format.args = list(big.mark = ','),
                col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max"))
kable_styling(result, full_width = FALSE, position = "left")

Second - if pipe-like syntax (left to right) is more readable, this too can be achieved within base:

aggregate(price ~ cut, data=diamonds, FUN=function(x) round(summary(x)[-1], 2)) ->.
kable(., format = "html", format.args = list(big.mark = ','),
      col.names = c("Cut", "Q1", "Median", "Mean", "Q3", "Max")) ->.
kable_styling(., full_width = FALSE, position = "left")

This would also allow to stop in the middle of the "pipeline" and continue from where you left.

So the way I see it the discussion about the advantages and disadvantages of pipe could be compared with this style of syntax instead. Especially because all the advantages proposed seem to only be about readability so far.

@KKPMW
Thanks for the alternate example. The only problem I would see is overriding the value of ., but since R uses fairly different operators for method calls on objects it may not be terrible.

That said, there is still some potential overhead differences from reassigning values to the existing object in memory. While I don’t agree with everything @DavidArenburg mentioned above, I do agree that there are definitely cases where data.table is definitely the right solution. What I’m less certain about is whether the same memory benefit is achieved if the object increases in memory consumption along the way. For example, if the data set were arbitrarily small and the aggregation result was several times larger (say something analogous to a multidimensional cube in the world of relational databases), would it still perform as well or would it run into memory corruption issues or overhead associated with reallocating memory, since the pointers would no longer provide access to the necessary amount of memory.

@wbuchanan

That said, there is still some potential overhead differences from reassigning values to the existing object in memory.

Based on a few benchmarks I am getting that with small objects the ->. assignment is a lot faster compared with pipe. And when the object size is large they converge to the same speed.

Small object:

x <- 1:10
microbenchmark(pipe={x %>% log %>% sqrt}, base={x ->.; log(.) ->.; sqrt(.)}, times=1000)

Unit: nanoseconds
 expr   min      lq      mean  median    uq    max neval
 pipe 49933 52920.0 57798.409 55523.5 61429 167158  1000
 base   572   708.5   929.804   834.0   946  55760  1000

Large object:

x <- matrix(abs(rnorm(1000000*100)), ncol=100)
microbenchmark(pipe={x %>% log %>% sqrt}, base={x ->.; log(.) ->.; sqrt(.)}, times=10)

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 pipe 2.003351 2.033280 2.057402 2.047359 2.081823 2.125832    10
 base 1.983885 2.016597 2.065985 2.065186 2.102157 2.143859    10

Of course haven't tested this thoroughly. But a few advantages of ->. that come to mind are: 1) no dependencies 2) easier to "get" what is going on behind the scenes 3) can stop at any step and inspect the result . then continue on with the next step without recomputing the whole pipeline 4) faster (probably).

@KKPMW try benching with bench::mark() or bench::press() as it also tests memory allocation.

@KKPMW
I think it is just as easy to step through the code regardless of the convention being used, but definitely interesting to see the differences in performance.

@DavidArenburg

Tried bench::mark and memory allocation only differed for very small objects (in favour of ->.).

# small object
x <- rnorm(20)+100
mark(pipe = {x %>% log %>% head(10) %>% sqrt},
     base = {x ->.; log(.) ->.; head(., 10) ->.; sqrt(.)},
     iterations=10)

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 pipe       124.29Β΅s    142Β΅s     6389.      488B        0    10     0
2 base         9.35Β΅s   10.6Β΅s    80682.      208B        0    10     0



# larger object
x <- rnorm(1000000)+100
mark(pipe = {x %>% log %>% head(10) %>% sqrt},
     base = {x ->.; log(.) ->.; head(., 10) ->.; sqrt(.)},
     iterations=10)

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 pipe         7.93ms   8.63ms      113.    7.63MB        0    10     0
2 base         7.66ms   8.29ms      120.    7.63MB        0    10     0

@ljanda:
Had you had considered using the switch() function in your base R example it would have worked against your argument.
Better yet (and leaving the knitr::kable() out because is out of the scope of discussion):

> require(data.table) 

> dt = as.data.table(diamonds)

> unique(
> dt[, c('Median', 'Mean', 'Q3', 'Max') := .(median(price), mean(price), quantile(price, 3/4), max(price)), keyby = cut], 
   by = 'cut')

    carat       cut color clarity depth table price    x    y    z Median     Mean      Q3   Max
1:  0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49   3282 4358.758 5205.50 18574
2:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31   3050 3928.864 5028.00 18788
3:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48   2648 3981.760 5372.75 18818
4:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31   3185 4584.258 6296.00 18823
5:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43   1810 3457.542 4678.50 18806

No group_by(), no summarize(), no round(). A simple line of code that clearly states the intent and is efficient.

Even simpler (and still no piping!):

> dt[, .(Median = median(as.numeric(price)), Mean = mean(price), Q3 = quantile(price, 3/4), Max = max(price)), keyby = cut]

         cut Median     Mean      Q3   Max
1:      Fair 3282.0 4358.758 5205.50 18574
2:      Good 3050.5 3928.864 5028.00 18788
3: Very Good 2648.0 3981.760 5372.75 18818
4:   Premium 3185.0 4584.258 6296.00 18823
5:     Ideal 1810.0 3457.542 4678.50 18806