rray_summarise()
Opened this issue · 1 comments
Function tapply() is the obvious way to produce arrays from data frames.
But users of dplyr have other aggregation functionality that keeps them in the realm of the tidy dataset format.
Perhaps it would be useful to ease their ocassional jumps to array computing offering a kind of tapply() tailored to their conventions.
If you dare to infringe Hadley Wickham's function names copyright, a simple example of this could be:
library(rray)
library(dplyr)
library(gapminder)
rray_summarise <- function(grtib,exp=1,FUN=sum,... ){
e<-rlang::enexpr(exp)
tib2 <- dplyr::transmute(grtib, `*var*`=!!e)
as_rray( tapply(tib2$`*var*`,
INDEX= as.list(tib2[attr(tib2,"vars")]),
FUN=FUN,...)
)
}
gapminder %>% group_by(continent, year) %>% rray_summarise(pop/1000)
#> <rray<dbl>[,12][60]>
#> year
#> continent 1952 1957 1962 1967 1972 1977
#> Africa 237640.50 264837.74 296516.86 335289.49 379879.5 433061.0
#> Americas 345152.45 386953.92 433270.25 480746.62 529384.2 578067.7
#> Asia 1395357.35 1562780.60 1696357.18 1905662.90 2150972.2 2384513.6
#> Europe 418120.85 437890.35 460355.15 481178.96 500635.1 517164.5
#> Oceania 10686.01 11941.98 13283.52 14600.41 16106.1 17239.0
#> year
#> continent 1982 1987 1992 1997 2002
#> Africa 499348.59 574834.11 659081.52 743832.98 833723.92
#> Americas 630290.92 682753.97 739274.10 796900.41 849772.76
#> Asia 2610135.58 2871220.76 3133292.19 3383285.50 3601802.20
#> Europe 531266.90 543094.16 558142.80 568944.15 578223.87
#> Oceania 18394.85 19574.42 20919.65 22241.43 23454.83
#> year
#> continent 2007
#> Africa 929539.69
#> Americas 898871.18
#> Asia 3811953.83
#> Europe 586098.53
#> Oceania 24549.95
Created on 2019-06-20 by the reprex package (v0.2.1)
The main difference dplyr's group_by+summarise has with tapply (and thus, with a refined rray_sumarise) is the groups we are considering. In the first case groups are formed based on the data, an so only the combinations actually present in the data are returned. In the second, "a priori" clasifications are prescribed in the form of factor variables, and an exhaustive crossing of them will be the returned result no matter what the data set actually contains. Not only some individual cells, but even entire rows with no data will be in the result as long as their factor level was prescribed. The order of the levels would be kept as well.
This predictable result seems preferable in aggregate production automation scenarios.
This is an obvious aclaration, but I think it is important here as another justification (besides the ability to operate aggregates of diferent granularities thanks to rray broadcasting, of course) of why a functionality like this complements what dplyr offers now.
My rray_summarise() function based on current dplyr::group_by() doesn't address this completely.
(In terms of dplyr's issue#4392 , I am solving the "expand" part)
But in view of tidyverse/dplyr#4392 (comment) it could change to something completely different.