Cannot recover data frame inside plan after dynamic combining
Closed this issue · 2 comments
Description
After using combine
on a data frame, there is no documented method to recover the combined data frame from within the plan.
Reproducible example
library(drake)
library(dplyr)
compute_max_x <- function(df) {
mutate(df, max_x = max(x))
}
df = tibble(
g = rep(c("a", "b", "c"), length.out = 10),
x = runif(10)
)
plan <- drake_plan(
group = df$g,
df_group_max_splits = target(
compute_max_x(df),
dynamic = combine(df, .by = group)
),
df_group_max_combined_list = target(
bind_rows(df_group_max_splits),
dynamic = combine(df_group_max_splits)
),
df_overall_max = compute_max_x(df_group_max_combined_list[[1]])
)
make(plan)
Expected result
The data frame is split according to the grouping variable and calculation is performed correctly on each split (aside: using a function called combine
to perform a splitting operation is counterintuitive):
> readd(df_splits)
[[1]]
# A tibble: 4 x 3
g x max_x
<chr> <dbl> <dbl>
1 a 0.451 0.727
2 a 0.267 0.727
3 a 0.727 0.727
4 a 0.384 0.727
[[2]]
# A tibble: 3 x 3
g x max_x
<chr> <dbl> <dbl>
1 b 0.749 0.967
2 b 0.548 0.967
3 b 0.967 0.967
[[3]]
# A tibble: 3 x 3
g x max_x
<chr> <dbl> <dbl>
1 c 0.392 0.412
2 c 0.101 0.412
3 c 0.412 0.412
Calling combined
with no .by
argument combines the results as per the docs (aside 2: As I mentioned in another comment, one expects that calling bind_rows
on a list of data frames would return a data frame as it would if we weren't inside a Drake plan):
> readd(df2_list)
[[1]]
# A tibble: 10 x 3
g x max_x
<chr> <dbl> <dbl>
1 a 0.451 0.727
2 a 0.267 0.727
3 a 0.727 0.727
4 a 0.384 0.727
5 b 0.749 0.967
6 b 0.548 0.967
7 b 0.967 0.967
8 c 0.392 0.412
9 c 0.101 0.412
10 c 0.412 0.412
The df_overall_max
target should extract the data frame inside the single-item list and call the function on the entire combined data frame:
> readd(df_overall_max)
# A tibble: 10 x 3
g x max_x
<chr> <dbl> <dbl>
1 a 0.451 0.967
2 a 0.267 0.967
3 a 0.727 0.967
4 a 0.384 0.967
5 b 0.749 0.967
6 b 0.548 0.967
7 b 0.967 0.967
8 c 0.392 0.967
9 c 0.101 0.967
10 c 0.412 0.967
What should have happened? Please be as specific as possible.
fail df_overall_max
Error: Target `df_overall_max` failed. Call `diagnose(df_overall_max)` for details. Error message:
no applicable method for 'mutate_' applied to an object of class "character"
If I add df2_class = class(df_group_max_combined_list)
as a target and readd
it out, I get "drake_dynamic"
back instead of a list.
I've read through the dynamic branching chapter in the book and can't seem to find a way to continue working with a target after doing the combine .by
+combine
operation. The examples end with calling readd
on the target containing the single-item list, so I'm not actually sure if this is supported.
On the other hand, I might very well be misunderstanding how combine
is supposed to work, in which case this wouldn't be a bug but rather a need to expand the docs.
(Still, I'm very excited about the dynamic branching feature and hope to make greater use of it soon!)
Session info
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] drake_7.7.0.9002 dplyr_0.8.1
loaded via a namespace (and not attached):
[1] igraph_1.2.4.1 Rcpp_1.0.1 magrittr_1.5 tidyselect_0.2.5 R6_2.4.0 rlang_0.3.4 fansi_0.4.0 storr_1.2.1 tools_3.6.0 utf8_1.1.4 cli_1.1.0 base64url_1.4 assertthat_0.2.1 digest_0.6.19
[15] tibble_2.1.3 crayon_1.3.4 txtq_0.1.4 purrr_0.2.4 vctrs_0.1.0 zeallot_0.1.0 glue_1.3.1 compiler_3.6.0 pillar_1.4.1 filelock_1.0.2 backports_1.1.4 renv_0.8.2 pkgconfig_2.0.2
So glad to see these dynamic branching issues coming in so soon.
Things work smoothest if all targets downstream of dynamic targets are also dynamic. Another map()
call is useful, even if there is only one sub-target.
library(drake)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
compute_max_x <- function(df) {
mutate(df, max_x = max(x))
}
df <- tibble(
g = rep(c("a", "b", "c"), length.out = 10),
x = runif(10)
)
plan <- drake_plan(
group = df$g,
df_group_max_splits = target(
compute_max_x(df),
dynamic = combine(df, .by = group)
),
df_group_max_combined_list = target(
bind_rows(df_group_max_splits),
dynamic = combine(df_group_max_splits)
),
df_overall_max = target(
compute_max_x(df_group_max_combined_list),
dynamic = map(df_group_max_combined_list)
)
)
make(plan)
#> target group
#> dynamic df_group_max_splits
#> subtarget df_group_max_splits_5319b5d3
#> subtarget df_group_max_splits_7f723e65
#> subtarget df_group_max_splits_2e487914
#> aggregate df_group_max_splits
#> dynamic df_group_max_combined_list
#> subtarget df_group_max_combined_list_a98b3360
#> aggregate df_group_max_combined_list
#> dynamic df_overall_max
#> subtarget df_overall_max_5077f7a9
#> aggregate df_overall_max
readd(df_group_max_combined_list)
#> [[1]]
#> # A tibble: 10 x 3
#> g x max_x
#> <chr> <dbl> <dbl>
#> 1 a 0.617 0.617
#> 2 a 0.0521 0.617
#> 3 a 0.381 0.617
#> 4 a 0.566 0.617
#> 5 b 0.382 0.725
#> 6 b 0.399 0.725
#> 7 b 0.725 0.725
#> 8 c 0.600 0.600
#> 9 c 0.114 0.600
#> 10 c 0.375 0.600
readd(df_overall_max)
#> [[1]]
#> # A tibble: 10 x 3
#> g x max_x
#> <chr> <dbl> <dbl>
#> 1 a 0.617 0.725
#> 2 a 0.0521 0.725
#> 3 a 0.381 0.725
#> 4 a 0.566 0.725
#> 5 b 0.382 0.725
#> 6 b 0.399 0.725
#> 7 b 0.725 0.725
#> 8 c 0.600 0.725
#> 9 c 0.114 0.725
#> 10 c 0.375 0.725
Created on 2019-11-13 by the reprex package (v0.3.0)
Alternatively, readd()
pulls dynamic sub-targets back into the land of static branching. The new subtargets
argument is useful if you do not want to load them all.
library(drake)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
compute_max_x <- function(df) {
mutate(df, max_x = max(x))
}
df <- tibble(
g = rep(c("a", "b", "c"), length.out = 10),
x = runif(10)
)
plan <- drake_plan(
group = df$g,
df_group_max_splits = target(
compute_max_x(df),
dynamic = combine(df, .by = group)
),
df_group_max_combined_list = target(
bind_rows(df_group_max_splits),
dynamic = combine(df_group_max_splits)
),
df_overall_max = compute_max_x(readd(df_group_max_combined_list)[[1]])
)
make(plan)
#> target group
#> dynamic df_group_max_splits
#> subtarget df_group_max_splits_d7fb6b4e
#> subtarget df_group_max_splits_d7b62986
#> subtarget df_group_max_splits_30627286
#> aggregate df_group_max_splits
#> dynamic df_group_max_combined_list
#> subtarget df_group_max_combined_list_ecff515c
#> aggregate df_group_max_combined_list
#> target df_overall_max
readd(df_group_max_combined_list)
#> [[1]]
#> # A tibble: 10 x 3
#> g x max_x
#> <chr> <dbl> <dbl>
#> 1 a 0.567 0.629
#> 2 a 0.563 0.629
#> 3 a 0.629 0.629
#> 4 a 0.254 0.629
#> 5 b 0.653 0.846
#> 6 b 0.846 0.846
#> 7 b 0.282 0.846
#> 8 c 0.429 0.971
#> 9 c 0.971 0.971
#> 10 c 0.433 0.971
readd(df_overall_max)
#> # A tibble: 10 x 3
#> g x max_x
#> <chr> <dbl> <dbl>
#> 1 a 0.567 0.971
#> 2 a 0.563 0.971
#> 3 a 0.629 0.971
#> 4 a 0.254 0.971
#> 5 b 0.653 0.971
#> 6 b 0.846 0.971
#> 7 b 0.282 0.971
#> 8 c 0.429 0.971
#> 9 c 0.971 0.971
#> 10 c 0.433 0.971
Created on 2019-11-13 by the reprex package (v0.3.0)
Maybe the manual should discuss these issues. Not exactly sure where it fits in the flow of the current dynamic branching chapter. PRs to the manual always welcome (recently moved here).
aside: using a function called combine to perform a splitting operation is counterintuitive
I originally planned a dynamic split()
, but during implementation, I noticed that split()
and combine()
were doing the exact same thing. Rather than keep both, I chose to stick with the more common of the two verbs.