duckdblabs/duckplyr

Inconsistent reuse of result in `summarize()`

Closed this issue · 1 comments

@hadley: reusing summary variables may be a mistake, more often than not. This is not handled in duckplyr yet. Should we implement the dangerous behavior or find a better way in dplyr? We could require the reuse of a computed summary variable to use an adverb, for example:

options(conflicts.policy = list(warn = FALSE))
library(duckplyr)

data <- tibble(a = c(1L, 1:2), b = 1:3, c = 4:6)
data
#> # A tibble: 3 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     1     4
#> 2     1     2     5
#> 3     2     3     6

data |>
  summarize(
    .by = a,
    b = sum(b),
    c = sum(b * c),
  )
#> # A tibble: 2 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     3    27
#> 2     2     3    18

data |>
  as_duckplyr_df() |>
  summarize(
    .by = a,
    b = sum(b),
    c = sum(b * c),
  )
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Aggregate [a, sum(b), sum(*(b, c))]
#>   r_dataframe_scan(0x12cbe0ec8)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> - b (HUGEINT)
#> - c (HUGEINT)
#> 
#> # A tibble: 2 × 3
#>       a     b     c
#>   <int> <dbl> <dbl>
#> 1     2     3    18
#> 2     1     3    14

sum(1:2 * 4:5)
#> [1] 14
sum(sum(1:2) * 4:5)
#> [1] 27

Created on 2023-11-09 with reprex v2.0.2

I don't think we want to make any big design changes based on this unusual edge case. I would suggest duckdplyr warns when it encounters this situation and then we can use information from that to determine if we can pull it from dplyr.