ropensci/drake

Cannot recover data frame inside plan after dynamic combining

Closed this issue · 2 comments

Description

After using combine on a data frame, there is no documented method to recover the combined data frame from within the plan.

Reproducible example

library(drake)
library(dplyr)

compute_max_x <- function(df) {
  mutate(df, max_x = max(x))
}

df = tibble(
  g = rep(c("a", "b", "c"), length.out = 10),
  x = runif(10)
)

plan <- drake_plan(
  group = df$g,

  df_group_max_splits = target(
    compute_max_x(df),
    dynamic = combine(df, .by = group)
  ),

  df_group_max_combined_list = target(
    bind_rows(df_group_max_splits),
    dynamic = combine(df_group_max_splits)
  ),

  df_overall_max = compute_max_x(df_group_max_combined_list[[1]])
)

make(plan)

Expected result

The data frame is split according to the grouping variable and calculation is performed correctly on each split (aside: using a function called combine to perform a splitting operation is counterintuitive):

> readd(df_splits)
[[1]]
# A tibble: 4 x 3
  g         x max_x
  <chr> <dbl> <dbl>
1 a     0.451 0.727
2 a     0.267 0.727
3 a     0.727 0.727
4 a     0.384 0.727

[[2]]
# A tibble: 3 x 3
  g         x max_x
  <chr> <dbl> <dbl>
1 b     0.749 0.967
2 b     0.548 0.967
3 b     0.967 0.967

[[3]]
# A tibble: 3 x 3
  g         x max_x
  <chr> <dbl> <dbl>
1 c     0.392 0.412
2 c     0.101 0.412
3 c     0.412 0.412

Calling combined with no .by argument combines the results as per the docs (aside 2: As I mentioned in another comment, one expects that calling bind_rows on a list of data frames would return a data frame as it would if we weren't inside a Drake plan):

> readd(df2_list)
[[1]]
# A tibble: 10 x 3
   g         x max_x
   <chr> <dbl> <dbl>
 1 a     0.451 0.727
 2 a     0.267 0.727
 3 a     0.727 0.727
 4 a     0.384 0.727
 5 b     0.749 0.967
 6 b     0.548 0.967
 7 b     0.967 0.967
 8 c     0.392 0.412
 9 c     0.101 0.412
10 c     0.412 0.412

The df_overall_max target should extract the data frame inside the single-item list and call the function on the entire combined data frame:

> readd(df_overall_max)
# A tibble: 10 x 3
   g         x max_x
   <chr> <dbl> <dbl>
 1 a     0.451 0.967
 2 a     0.267 0.967
 3 a     0.727 0.967
 4 a     0.384 0.967
 5 b     0.749 0.967
 6 b     0.548 0.967
 7 b     0.967 0.967
 8 c     0.392 0.967
 9 c     0.101 0.967
10 c     0.412 0.967

What should have happened? Please be as specific as possible.

fail df_overall_max
Error: Target `df_overall_max` failed. Call `diagnose(df_overall_max)` for details. Error message:
  no applicable method for 'mutate_' applied to an object of class "character"

If I add df2_class = class(df_group_max_combined_list) as a target and readd it out, I get "drake_dynamic" back instead of a list.

I've read through the dynamic branching chapter in the book and can't seem to find a way to continue working with a target after doing the combine .by+combine operation. The examples end with calling readd on the target containing the single-item list, so I'm not actually sure if this is supported.

On the other hand, I might very well be misunderstanding how combine is supposed to work, in which case this wouldn't be a bug but rather a need to expand the docs.

(Still, I'm very excited about the dynamic branching feature and hope to make greater use of it soon!)

Session info

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] drake_7.7.0.9002 dplyr_0.8.1     

loaded via a namespace (and not attached):
 [1] igraph_1.2.4.1   Rcpp_1.0.1       magrittr_1.5     tidyselect_0.2.5 R6_2.4.0         rlang_0.3.4      fansi_0.4.0      storr_1.2.1      tools_3.6.0      utf8_1.1.4       cli_1.1.0        base64url_1.4    assertthat_0.2.1 digest_0.6.19   
[15] tibble_2.1.3     crayon_1.3.4     txtq_0.1.4       purrr_0.2.4      vctrs_0.1.0      zeallot_0.1.0    glue_1.3.1       compiler_3.6.0   pillar_1.4.1     filelock_1.0.2   backports_1.1.4  renv_0.8.2       pkgconfig_2.0.2 

So glad to see these dynamic branching issues coming in so soon.

Things work smoothest if all targets downstream of dynamic targets are also dynamic. Another map() call is useful, even if there is only one sub-target.

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

compute_max_x <- function(df) {
  mutate(df, max_x = max(x))
}

df <- tibble(
  g = rep(c("a", "b", "c"), length.out = 10),
  x = runif(10)
)

plan <- drake_plan(
  group = df$g,
  
  df_group_max_splits = target(
    compute_max_x(df),
    dynamic = combine(df, .by = group)
  ),
  
  df_group_max_combined_list = target(
    bind_rows(df_group_max_splits),
    dynamic = combine(df_group_max_splits)
  ),
  
  df_overall_max = target(
    compute_max_x(df_group_max_combined_list),
    dynamic = map(df_group_max_combined_list)
  )
)

make(plan)
#> target group
#> dynamic df_group_max_splits
#> subtarget df_group_max_splits_5319b5d3
#> subtarget df_group_max_splits_7f723e65
#> subtarget df_group_max_splits_2e487914
#> aggregate df_group_max_splits
#> dynamic df_group_max_combined_list
#> subtarget df_group_max_combined_list_a98b3360
#> aggregate df_group_max_combined_list
#> dynamic df_overall_max
#> subtarget df_overall_max_5077f7a9
#> aggregate df_overall_max

readd(df_group_max_combined_list)
#> [[1]]
#> # A tibble: 10 x 3
#>    g          x max_x
#>    <chr>  <dbl> <dbl>
#>  1 a     0.617  0.617
#>  2 a     0.0521 0.617
#>  3 a     0.381  0.617
#>  4 a     0.566  0.617
#>  5 b     0.382  0.725
#>  6 b     0.399  0.725
#>  7 b     0.725  0.725
#>  8 c     0.600  0.600
#>  9 c     0.114  0.600
#> 10 c     0.375  0.600

readd(df_overall_max)
#> [[1]]
#> # A tibble: 10 x 3
#>    g          x max_x
#>    <chr>  <dbl> <dbl>
#>  1 a     0.617  0.725
#>  2 a     0.0521 0.725
#>  3 a     0.381  0.725
#>  4 a     0.566  0.725
#>  5 b     0.382  0.725
#>  6 b     0.399  0.725
#>  7 b     0.725  0.725
#>  8 c     0.600  0.725
#>  9 c     0.114  0.725
#> 10 c     0.375  0.725

Created on 2019-11-13 by the reprex package (v0.3.0)

Alternatively, readd() pulls dynamic sub-targets back into the land of static branching. The new subtargets argument is useful if you do not want to load them all.

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

compute_max_x <- function(df) {
  mutate(df, max_x = max(x))
}

df <- tibble(
  g = rep(c("a", "b", "c"), length.out = 10),
  x = runif(10)
)

plan <- drake_plan(
  group = df$g,
  
  df_group_max_splits = target(
    compute_max_x(df),
    dynamic = combine(df, .by = group)
  ),
  
  df_group_max_combined_list = target(
    bind_rows(df_group_max_splits),
    dynamic = combine(df_group_max_splits)
  ),
  
  df_overall_max = compute_max_x(readd(df_group_max_combined_list)[[1]])
)

make(plan)
#> target group
#> dynamic df_group_max_splits
#> subtarget df_group_max_splits_d7fb6b4e
#> subtarget df_group_max_splits_d7b62986
#> subtarget df_group_max_splits_30627286
#> aggregate df_group_max_splits
#> dynamic df_group_max_combined_list
#> subtarget df_group_max_combined_list_ecff515c
#> aggregate df_group_max_combined_list
#> target df_overall_max

readd(df_group_max_combined_list)
#> [[1]]
#> # A tibble: 10 x 3
#>    g         x max_x
#>    <chr> <dbl> <dbl>
#>  1 a     0.567 0.629
#>  2 a     0.563 0.629
#>  3 a     0.629 0.629
#>  4 a     0.254 0.629
#>  5 b     0.653 0.846
#>  6 b     0.846 0.846
#>  7 b     0.282 0.846
#>  8 c     0.429 0.971
#>  9 c     0.971 0.971
#> 10 c     0.433 0.971

readd(df_overall_max)
#> # A tibble: 10 x 3
#>    g         x max_x
#>    <chr> <dbl> <dbl>
#>  1 a     0.567 0.971
#>  2 a     0.563 0.971
#>  3 a     0.629 0.971
#>  4 a     0.254 0.971
#>  5 b     0.653 0.971
#>  6 b     0.846 0.971
#>  7 b     0.282 0.971
#>  8 c     0.429 0.971
#>  9 c     0.971 0.971
#> 10 c     0.433 0.971

Created on 2019-11-13 by the reprex package (v0.3.0)

Maybe the manual should discuss these issues. Not exactly sure where it fits in the flow of the current dynamic branching chapter. PRs to the manual always welcome (recently moved here).

aside: using a function called combine to perform a splitting operation is counterintuitive

I originally planned a dynamic split(), but during implementation, I noticed that split() and combine() were doing the exact same thing. Rather than keep both, I chose to stick with the more common of the two verbs.