SebKrantz/collapse

`.by` in fmutate and fsummarize arguments?

kylebutts opened this issue · 1 comments

Hi Sebastian,

One new thing that dplyr has added that I love is the ability to pass .by in mutate/summarize/filter (skipping group_by() and ungroup()). Is this a feature that could be added to f... functions?

For example, using masking now results in different results.

library(tidyverse)
mtcars |>
  subset(mpg > 11) |>
  summarise(
    across(c(mpg, carb, hp), mean),
    qsec_wt = weighted.mean(qsec, wt),
    .by = c("cyl", "vs", "am")
  )
#>   cyl vs am      mpg     carb        hp  qsec_wt
#> 1   6  0  1 20.56667 4.666667 131.66667 16.33306
#> 2   4  1  1 28.37143 1.428571  80.57143 18.75509
#> 3   6  1  0 19.12500 2.500000 115.25000 19.21275
#> 4   8  0  0 15.98000 2.900000 191.00000 17.01239
#> 5   4  1  0 22.90000 1.666667  84.66667 21.04028
#> 6   4  0  1 26.00000 2.000000  91.00000 16.70000
#> 7   8  0  1 15.40000 6.000000 299.50000 14.55297


library(collapse)
#> collapse 2.0.9, see ?`collapse-package` or ?`collapse-documentation`
set_collapse(mask = "manip")

mtcars |>
  subset(mpg > 11) |>
  summarise(
    across(c(mpg, carb, hp), mean),
    qsec_wt = weighted.mean(qsec, wt),
    .by = c("cyl", "vs", "am")
  )
#>        mpg     carb       hp  qsec_wt .by
#> 1 20.73667 2.733333 142.4667 17.74035 cyl
#> 2 20.73667 2.733333 142.4667 17.74035  vs
#> 3 20.73667 2.733333 142.4667 17.74035  am

Created on 2024-02-22 with reprex v2.1.0

Hi, I understand the impulse, but I don't think this is very useful to collapse. In fsummarise() there is no regrouping, and fgroup_by() does not do more than required, so when using the Fast Statistical Functions,

mtcars |>
  subset(mpg > 11) |>
  group_by(cyl, vs, am) |>
  summarise(
    across(c(mpg, carb, hp), fmean),
    qsec_wt = fmean(qsec, wt)
  )

is as efficient as the .by solution. There is also collap(~ cyl + vs + am, w = ~wt, custom = list(fmean_uw = .c(mpg, carb, hp), fmean = "qsec"), keep.col.order = FALSE). With fmutate(), you have a generalization of the .by behavior through the g arguments to fast statistical functions, e.g.

mtcars |> 
  mutate(across(c(mpg, carb, hp), fmean, list(cyl, vs, am), TRA = "fill"))  

This of course makes it repetitive to compute multiple expressions with the same grouping (and in that case fgroup_by() and fungroup() would make sense), but on the other hand you can use different groupings in the same fmutate()/ftransform() call, or even in a single expression. For example, observing country-sector level trade, you can compute revealed comparative advantage on one line:

exports = data.frame(c = rep(1:10, each = 10), 
                     s = rep(1:10, 10),
                     v = abs(rnorm(100))) 
exports |> 
  mutate(rca = fsum(v, c, TRA = "/") / fsum(v, s, TRA = "/"))

Thus I think the absense of regrouping in fsummarise(), the availability of collap(), and the incorporation of grouping and transformations (including transformation by reference using set = TRUE) into Fast Statistical Functions make this feature redundant in collapse.