Inconsistent reuse of result in `summarize()`
Closed this issue · 1 comments
krlmlr commented
@hadley: reusing summary variables may be a mistake, more often than not. This is not handled in duckplyr yet. Should we implement the dangerous behavior or find a better way in dplyr? We could require the reuse of a computed summary variable to use an adverb, for example:
options(conflicts.policy = list(warn = FALSE))
library(duckplyr)
data <- tibble(a = c(1L, 1:2), b = 1:3, c = 4:6)
data
#> # A tibble: 3 × 3
#> a b c
#> <int> <int> <int>
#> 1 1 1 4
#> 2 1 2 5
#> 3 2 3 6
data |>
summarize(
.by = a,
b = sum(b),
c = sum(b * c),
)
#> # A tibble: 2 × 3
#> a b c
#> <int> <int> <int>
#> 1 1 3 27
#> 2 2 3 18
data |>
as_duckplyr_df() |>
summarize(
.by = a,
b = sum(b),
c = sum(b * c),
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Aggregate [a, sum(b), sum(*(b, c))]
#> r_dataframe_scan(0x12cbe0ec8)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
#> - b (HUGEINT)
#> - c (HUGEINT)
#>
#> # A tibble: 2 × 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 2 3 18
#> 2 1 3 14
sum(1:2 * 4:5)
#> [1] 14
sum(sum(1:2) * 4:5)
#> [1] 27
Created on 2023-11-09 with reprex v2.0.2
hadley commented
I don't think we want to make any big design changes based on this unusual edge case. I would suggest duckdplyr warns when it encounters this situation and then we can use information from that to determine if we can pull it from dplyr.