Confusion about large plans section: first introduce the general case?

Question

Confusion about large plans section: first introduce the general case?

japhir opened this issue 5 years ago · 5 comments

Hi there! Thanks for the great package, I'm just discovering it.

Whilst reading the manual, I'm trying to find something on very long data processing pipelines and how to create very long plans. This brings me to the obvious section on large plans. However, this documents how to create a plan with much duplication using helper functions and smart syntax, not how to create a… large plan. I think I've figured out that I could create several plans and combine them using bind_plans(). If that's correct, I think this section of the manual could start with a short paragraph on how with very long plans it can become difficult to keep the overview, and that several plans can be created using drake_plan() and then combined using bind_plans().

I hope I've understood correctly, and that you agree that this would be an improvement to the manual!

Kind regards

Answer 1 · 2019-10-22T01:15:01.000Z

However, this documents how to create a plan with much duplication using helper functions and smart syntax, not how to create a… large plan.

What do you think of as a large plan? Would you post a sketch of an example? I am having trouble seeing what you mean by "large".

I think of the plans in the manual as large because they have a lot of targets. Those targets are not exact duplicates, but they do have patterns and commonalities that allow us to express them with minimal code.

I think I've figured out that I could create several plans and combine them using bind_plans(). If that's correct, I think this section of the manual could start with a short paragraph on how with very long plans it can become difficult to keep the overview, and that several plans can be created using drake_plan() and then combined using bind_plans()

bind_plans() is acceptable, but usually suboptimal. If we want to chain large batches of targets together, we suddenly have trouble referring to upstream targets.

The following plan makes extensive use of transformations, an it scales well as we increase the number of datasets we want to analyze.

library(drake)
library(rlang)
datasets <- syms(c("gapminder", "diamonds")) # Add more datasets here.
plan <- drake_plan(
  data = target(
    munge(raw),
    transform = map(raw = !!datasets)
  ),
  analysis = target(
    analyze(data),
    transform = map(data, .id = raw)
  ),
  summary = target(
    summarize(analysis),
    transform = map(analysis, .id = raw)
  ),
  results = target(
    bind_rows(summary),
    transform = combine(summary)
  )
)

plan
#> # A tibble: 7 x 2
#>   target             command                                       
#>   <chr>              <expr>                                        
#> 1 data_gapminder     munge(gapminder)                              
#> 2 data_diamonds      munge(diamonds)                               
#> 3 analysis_gapminder analyze(data_gapminder)                       
#> 4 analysis_diamonds  analyze(data_diamonds)                        
#> 5 summary_gapminder  summarize(analysis_gapminder)                 
#> 6 summary_diamonds   summarize(analysis_diamonds)                  
#> 7 results            bind_rows(summary_gapminder, summary_diamonds)

drake_plan_source(plan)
#> drake_plan(
#>   data_gapminder = munge(gapminder),
#>   data_diamonds = munge(diamonds),
#>   analysis_gapminder = analyze(data_gapminder),
#>   analysis_diamonds = analyze(data_diamonds),
#>   summary_gapminder = summarize(analysis_gapminder),
#>   summary_diamonds = summarize(analysis_diamonds),
#>   results = bind_rows(summary_gapminder, summary_diamonds)
#> )

config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2019-10-21 by the reprex package (v0.3.0)}

But if we try to do this with separate plans, it becomes more challenging because we have to recreate old grouping variables for downstream plans (such as raw and analyses below). Possible, but inconvenient.

library(drake)
library(rlang)
datasets <- syms(c("gapminder", "diamonds"))
plan1 <- drake_plan(
  data = target(
    munge(raw),
    transform = map(raw = !!datasets)
  ),
  analysis = target(
    analyze(data),
    transform = map(data, .id = raw)
  )
)

# It is annoying to recreate the list of analyses.
analyses <- syms(c("analysis_gapminder", "analysis_diamonds"))

plan2 <- drake_plan(
  summary = target(
    summarize(analysis),
    transform = map(analysis = !!analyses, raw = !!datasets, .id = raw)
  ),
  results = target(
    bind_rows(summary),
    transform = combine(summary)
  )
)

plan <- bind_plans(plan1, plan2)

drake_plan_source(plan)
#> drake_plan(
#>   data_gapminder = munge(gapminder),
#>   data_diamonds = munge(diamonds),
#>   analysis_gapminder = analyze(data_gapminder),
#>   analysis_diamonds = analyze(data_diamonds),
#>   summary_gapminder = summarize(analysis_gapminder),
#>   summary_diamonds = summarize(analysis_diamonds),
#>   results = bind_rows(summary_gapminder, summary_diamonds)
#> )

config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2019-10-21 by the reprex package (v0.3.0)}

Answer 2 · 2019-10-22T10:01:48.000Z

I'm sorry for not making myself more clear. Thanks for the elaborate and prompt reply! From this and your previous interactions here you're an inspiration on how to engage with your online user/developer community!

Here's my work in progress plan, which I consider as large. It has many processing steps that I have mostly captured into tiny functions, and for which I like to save intermediate results because some processing steps take long and I frequently change the underlying functions of the pipeline.

plan <- drake_plan(
  # rds file from other script
  didinfo = read_rds(file_in("~/SurfDrive/PhD/programming/dataprocessing/didinfo.rds")),
  wr_deep_runs = find_runs(didinfo),
  wr_deep_incl_bracketing_runs = add_bracketing_runs(wr_deep_runs),
  ilja_uu = didinfo %>%
    filter(Preparation %in% wr_deep_incl_bracketing_runs,
           !(str_detect(Comment, c("^Rog$|^C$|^CF$|^RH$|^IW$|^IAM$|^AvdM$|^JM$|^Xiaolong$|^Joris$|^NdW$|^NvdP$|^XH$|^MZ$|^RvdP$|^DR$|^Ndw$|^Kasper$")) & Comment == "other")) %>%
    add_session() %>%
    add_bg_group(bg_dates = as.POSIXct(bgd_mod$date, tz = "CEST")),
  ## bgd_events = read_rds(file_in("../../programming/dataprocessing/out/bgd_events.rds")),
  bgd_mod = read_rds(file_in("../../programming/dataprocessing/out/bg_mod.rds")),
  ## bg_res_unnest = read_rds(file_in("../../programming/dataprocessing/out/bg_res_unnest.rds")),
  bgd_duration = scan_gaps(bgd_mod),
  bgd_mods = add_bg_info(ilja_uu, bgd_mod),
  bg_corrected = ilja_uu %>%
    iso_get_raw_data() %>%
    left_join(bgd_mods, by = "file_id") %>%
    correct_backgrounds_scn(),
  cycles_disabled = bg_corrected %>%
    disable_cycles(fac = 3, relative_to = "init", genplot = FALSE),
  pl_cyc = plot_disabled_cycles(cycles_disabled),
  raw_delta_values = cycles_disabled %>%
    spread_match() %>%
    filter(cycle != 0) %>%
    append_ref_deltas(.did = ilja_uu) %>%
    delta_values(genplot = F) %>%
    collapse_cycles() %>%
    add_info(iso_get_file_info(ilja_uu)) %>%
    unnest(cols = cycle_data),
  outliers_detected = raw_delta_values %>%
    remove_outliers(genplot = FALSE, session = Session) %>%
    remove_custom_outliers(),
  pl_out = outliers_detected %>%
    filter(!outlier, !is.na(outlier), !outlier_cycle) %>%
    plot_base(x = file_datetime, y = D47_raw, shape = outlier, colour = broadid) +
    stat_summary(geom = "pointrange", fun.data = mean_cl_normal),
  raw_collapsed = outliers_detected %>%
    collapse_cycles(d13C_PDB, d18O_PDB, D45_raw, D46_raw, D47_raw, D48_raw, D49_raw, param_49,
      id = c(Analysis, file_id, file_datetime, Preparation, Line, Sample, broadid, Comment, outlier)
    ) %>%
    select(-cycle_data),
  etf_calcd = outliers_detected %>%
    empirical_transfer_function(session = Session, genplot = F) %>%
    select_if(~ !is.list(.x)),
  etf_deltad = etf_calcd %>%
    delta_etfs() %>%
    select_if(~ !is.list(.x)),
  temperatures_calculated = etf_deltad %>%
    acid_fractionation %>%
    temperature_calculation() %>%
    add_timeofday() %>%
    create_label(),
  uu_depth_info = read_depth_info(file_in("wip19_uu_excel.rds")),
  badlabs = find_bad_labels(temperatures_calculated, info = uu_depth_info),
  uu_with_depth = temperatures_calculated %>%
    left_join(uu_depth_info, by = c("label" = "id")) %>%
    filter(!is.na(label)) %>%
    add_species() %>%
    vital_effects() %>%
    fluid_d18O(),
  uu_stable_isotopes = read_rds(file_in("wip19_uu_tidy-ish.rds")) %>%
    clean_uu_isotopes(),
  hod19 = read_rds(file_in("hodel_2019_unpublished.rds")),
  url_bel = "https://www1.ncdc.noaa.gov/pub/data/paleo/contributions_by_author/bell2014/bell2014-1264.txt",
  bel14 = target(command = read_tsv(url_bel, skip = 69, guess_max = 2000),
                 trigger = trigger(change = last_url_date(url_bel))),
  bel14_clean = bel14 %>%
    select(depth_mbsf = depth_m, age = age_calkaBP, d13C = d13Cc.wuell, d18O = `d18Oc.wuell-cor64`) %>%
    mutate(d18O = d18O - 0.64, # undo diseq. correction
           spec = "Cib. wul.", ref = "Bell et al., 2014", site = 1264L, depth=depth_mbsf),
  url_mod = "https://data.giss.nasa.gov/cgi-bin/o18data/geto18.cgi?cmd=tab&id=tmp947915143",
  mod_d18O = target(command = read_tsv(url_mod, skip = 69, guess_max = 2000),
                    trigger = trigger(change = last_url_date(url_mod))),
  avg_d18O = mod_d18O %>%
    filter(Longitude > 2, Longitude < 3, Latitude > -30, Latitude <- -20) %>%
    ## mutate(d18O = parse_double(d18O)) %>%
    pull(d18O) %>%
    mean_cl_normal(),
  temp_plot = uu_with_depth %>%
    collapse_cycles(d13C_PDB_etf, d13C_PDB, d18O_PDB_vit, d18O_PDB, D47_raw, D47_final, temperature, d18O_sw_SMOW,
                    id = c(Analysis, file_id, file_datetime, Preparation, label, broadid, Comment, age, outlier, spec)) %>%
    select(-cycle_data) %>%
    filter(!Analysis %in% c(7035, 7026)) %>%
    add_agebreaks(),
  binranges = summarize_bins(temp_plot),
  alldat = uu_stable_isotopes %>%
    filter(machine != "clumped") %>%
    mutate(ref_old = ref, ref = ifelse(str_detect(ref_old, "Diederik"),
                                       "Liebrand et al., 2016",
                                       "UU stable isotopes")) %>%
    bind_rows(mutate(bel14, outlier = FALSE))
)

^{Sorry for not running this through the reprex package (v0.3.0), this is because then I'd also have to include all the functions and I think that won't help illustrate it better.}

Right now it doesn't run entirely because there are still some refactoring errors going on, but the basic structure is there. There are some steps that I could try to capture in separate functions still, but because this recipe for analysis will be used across a couple of datasets, I would like to refactor such that the functions are all generic, and that the plans are adapted to the individual dataset's needs. Does that sound like a good strategy or would it be easier to adapt the functions and keep the plan the same?

I didn't mean to get so specific, but I think it's easiest to explain what I mean by giving the whole example here.

So this is what I mean with a long plan. It gets confusing fast, but it works. The first rds file in the pipeline comes from a different script that requires significant computation time as well as connection to a remote server, so for now I've kept that external.

Answer 3 · 2019-10-22T12:18:41.000Z

Yeah, that is definitely a large plan. Not too large to write out by hand, but still large. And you're right, bind_plans() allows you to break up the plan into sections and combine those sections later. Even base::rbind() and dplyr::bind_rows() will work in most situations. Useful to consider, but straightforward.

The other kind of large plan, on the other hand, requires its own interface to understand and work with properly. It has a whole section in the manual because it requires a lot more explanation and discussion. There is just more to cover.

There are some steps that I could try to capture in separate functions still, but because this recipe for analysis will be used across a couple of datasets, I would like to refactor such that the functions are all generic, and that the plans are adapted to the individual dataset's needs. Does that sound like a good strategy or would it be easier to adapt the functions and keep the plan the same?

Absolutely, yes! One of the best things to do is to capture things in function, make those functions as reusable as possible, and turn the commands in the plan into short function calls. Those functions become your own personal shorthand that makes the plan easier to read. This is a strategy I am deliberately trying to spread.

So this is what I mean with a long plan. It gets confusing fast, but it works.

vis_drake_graph() and other graphing functions can help. The interactive graph is one of the most convenient ways to look a plan like yours and understand it at a high level, and it is particularly effective for totally hand-written plans of this size.

library(drake)
  plan <- drake_plan(
    # rds file from other script
    didinfo = read_rds(file_in("~/SurfDrive/PhD/programming/dataprocessing/didinfo.rds")),
    wr_deep_runs = find_runs(didinfo),
    wr_deep_incl_bracketing_runs = add_bracketing_runs(wr_deep_runs),
    ilja_uu = didinfo %>%
      filter(Preparation %in% wr_deep_incl_bracketing_runs,
             !(str_detect(Comment, c("^Rog$|^C$|^CF$|^RH$|^IW$|^IAM$|^AvdM$|^JM$|^Xiaolong$|^Joris$|^NdW$|^NvdP$|^XH$|^MZ$|^RvdP$|^DR$|^Ndw$|^Kasper$")) & Comment == "other")) %>%
      add_session() %>%
      add_bg_group(bg_dates = as.POSIXct(bgd_mod$date, tz = "CEST")),
    ## bgd_events = read_rds(file_in("../../programming/dataprocessing/out/bgd_events.rds")),
    bgd_mod = read_rds(file_in("../../programming/dataprocessing/out/bg_mod.rds")),
    ## bg_res_unnest = read_rds(file_in("../../programming/dataprocessing/out/bg_res_unnest.rds")),
    bgd_duration = scan_gaps(bgd_mod),
    bgd_mods = add_bg_info(ilja_uu, bgd_mod),
    bg_corrected = ilja_uu %>%
      iso_get_raw_data() %>%
      left_join(bgd_mods, by = "file_id") %>%
      correct_backgrounds_scn(),
    cycles_disabled = bg_corrected %>%
      disable_cycles(fac = 3, relative_to = "init", genplot = FALSE),
    pl_cyc = plot_disabled_cycles(cycles_disabled),
    raw_delta_values = cycles_disabled %>%
      spread_match() %>%
      filter(cycle != 0) %>%
      append_ref_deltas(.did = ilja_uu) %>%
      delta_values(genplot = F) %>%
      collapse_cycles() %>%
      add_info(iso_get_file_info(ilja_uu)) %>%
      unnest(cols = cycle_data),
    outliers_detected = raw_delta_values %>%
      remove_outliers(genplot = FALSE, session = Session) %>%
      remove_custom_outliers(),
    pl_out = outliers_detected %>%
      filter(!outlier, !is.na(outlier), !outlier_cycle) %>%
      plot_base(x = file_datetime, y = D47_raw, shape = outlier, colour = broadid) +
      stat_summary(geom = "pointrange", fun.data = mean_cl_normal),
    raw_collapsed = outliers_detected %>%
      collapse_cycles(d13C_PDB, d18O_PDB, D45_raw, D46_raw, D47_raw, D48_raw, D49_raw, param_49,
                      id = c(Analysis, file_id, file_datetime, Preparation, Line, Sample, broadid, Comment, outlier)
      ) %>%
      select(-cycle_data),
    etf_calcd = outliers_detected %>%
      empirical_transfer_function(session = Session, genplot = F) %>%
      select_if(~ !is.list(.x)),
    etf_deltad = etf_calcd %>%
      delta_etfs() %>%
      select_if(~ !is.list(.x)),
    temperatures_calculated = etf_deltad %>%
      acid_fractionation %>%
      temperature_calculation() %>%
      add_timeofday() %>%
      create_label(),
    uu_depth_info = read_depth_info(file_in("wip19_uu_excel.rds")),
    badlabs = find_bad_labels(temperatures_calculated, info = uu_depth_info),
    uu_with_depth = temperatures_calculated %>%
      left_join(uu_depth_info, by = c("label" = "id")) %>%
      filter(!is.na(label)) %>%
      add_species() %>%
      vital_effects() %>%
      fluid_d18O(),
    uu_stable_isotopes = read_rds(file_in("wip19_uu_tidy-ish.rds")) %>%
      clean_uu_isotopes(),
    hod19 = read_rds(file_in("hodel_2019_unpublished.rds")),
    url_bel = "https://www1.ncdc.noaa.gov/pub/data/paleo/contributions_by_author/bell2014/bell2014-1264.txt",
    bel14 = target(command = read_tsv(url_bel, skip = 69, guess_max = 2000),
                   trigger = trigger(change = last_url_date(url_bel))),
    bel14_clean = bel14 %>%
      select(depth_mbsf = depth_m, age = age_calkaBP, d13C = d13Cc.wuell, d18O = `d18Oc.wuell-cor64`) %>%
      mutate(d18O = d18O - 0.64, # undo diseq. correction
             spec = "Cib. wul.", ref = "Bell et al., 2014", site = 1264L, depth=depth_mbsf),
    url_mod = "https://data.giss.nasa.gov/cgi-bin/o18data/geto18.cgi?cmd=tab&id=tmp947915143",
    mod_d18O = target(command = read_tsv(url_mod, skip = 69, guess_max = 2000),
                      trigger = trigger(change = last_url_date(url_mod))),
    avg_d18O = mod_d18O %>%
      filter(Longitude > 2, Longitude < 3, Latitude > -30, Latitude <- -20) %>%
      ## mutate(d18O = parse_double(d18O)) %>%
      pull(d18O) %>%
      mean_cl_normal(),
    temp_plot = uu_with_depth %>%
      collapse_cycles(d13C_PDB_etf, d13C_PDB, d18O_PDB_vit, d18O_PDB, D47_raw, D47_final, temperature, d18O_sw_SMOW,
                      id = c(Analysis, file_id, file_datetime, Preparation, label, broadid, Comment, age, outlier, spec)) %>%
      select(-cycle_data) %>%
      filter(!Analysis %in% c(7035, 7026)) %>%
      add_agebreaks(),
    binranges = summarize_bins(temp_plot),
    alldat = uu_stable_isotopes %>%
      filter(machine != "clumped") %>%
      mutate(ref_old = ref, ref = ifelse(str_detect(ref_old, "Diederik"),
                                         "Liebrand et al., 2016",
                                         "UU stable isotopes")) %>%
      bind_rows(mutate(bel14, outlier = FALSE))
  )
  
  config <- drake_config(plan)
  vis_drake_graph(config)

^{Created on 2019-10-22 by the reprex package (v0.3.0)}

Answer 4 · 2019-10-22T12:47:50.000Z

Thanks! I'll keep refactoring to make things neat and understandable. With this strategy in mind that'll be a fun exercise :).

The other kind of large plan, on the other hand, requires its own interface to understand and work with properly. It has a whole section in the manual because it requires a lot more explanation and discussion. There is just more to cover.

In that case, maybe the title of the chapter could be changed to something like "Creating complex plans" or something similar?

Answer 5 · 2019-10-22T13:08:09.000Z

In that case, maybe the title of the chapter could be changed to something like "Creating complex plans" or something similar?

I will consider it, but I reference that section a lot, so I worry about breaking links. But there will come a time to seriously refactor the whole manual, in which case changes will be necessary anyway.