kaizadp/TES_spatial_access_2021

f-ing drake

kaizadp opened this issue · 10 comments

@bpbond

drake doesn't like functions?
I get this error when I run the drake_plan. It stops at the fit_hsd_xx functions. But it works if I run the functions before running the plan.

https://github.com/kaizadp/tes_spatial_access/blob/2e8024f14de3ca5e903c5732dbf2c6c327c6d1a1/code/2e-fticr_drake_plan.R#L339

▶ target fticr_hsd_totalpeaks
x fail fticr_hsd_totalpeaks
Error: target fticr_hsd_totalpeaks failed.
diagnose(fticr_hsd_totalpeaks)$error$message:
could not find function "fit_hsd_totalpeaks"
diagnose(fticr_hsd_totalpeaks)$error$calls:
peakcounts_core %>% filter(class == "total") %>% mutate(Suction = as.character(Suction)) %>%
group_by(Suction, Homogenization, Moisture) %>% do(fit_hsd_totalpeaks(.))
base::withVisible(eval(quote(_fseq(_lhs)), env, env))
base::eval(quote(_fseq(_lhs)), env, env)
base::eval(quote(_fseq(_lhs)), env, env)
_fseq(_lhs)
magrittr::freduce(value, _function_list)
base::withVisible(function_list[k])
function_list[k]
dplyr::do(., fit_hsd_totalpeaks(.))
dplyr:::do.grouped_df(., fit_hsd_totalpeaks(.))
rlang::eval_tidy(args[[j]], mask)

Hey @kaizadp . No, drake likes functions fine. 😕 Should I clone this and try to run?

Hey @kaizadp . No, drake likes functions fine. 😕 Should I clone this and try to run?

Yes, please!

Hey, so I see the immediate problem, but also a broader, more fundamental one.

The immediate problem is that, yes, you had <- calls in your pipeline, which for very good but arcane technical reasons drake doesn't like. And so no, drake doesn't 'like' functions in this way, so my answer above was misleading.

But, and here we get to the bigger issue, that's because you're misusing the concept of a pipeline. They're not intended or designed to have code dumped in them en masse like this. It quickly gets unreadable, if, say, you have an 800+ line pipeline 😄 Instead, drake pipelines are intended to be the high-level (or at worst medium-level) 'build steps' for your analysis.

For example, a huge amount of the current pipeline code is devoted to making plots. This shouldn't be here; it's confusing and distracts from one of the fundamental purposes of the pipeline, which is provide an easy-to-see list of the steps and dependencies in your analysis. Take

    gg_fticr_domains = 
      data_long_trt %>% 
      distinct(formula) %>% 
      left_join(dplyr::select(meta, formula, class, HC, OC), by = "formula") %>% 
      gg_vankrev(aes(x = OC, y = HC, color = class))+
      scale_color_manual(values = PNWColors::pnw_palette("Sailboat"))+
      theme_kp()+
      theme(legend.position = "right")+
      NULL,

I'd strongly suggest putting the code to generate this plot elsewhere, and then just

    gg_fticr_domains =  make_ gg_fticr_domains_plot(data_long_trt),

This makes it super clear what's happening, and what the dependency is. Again, you don't want to have to see the plot-generating code in your pipeline. It's too much detail.

Note: the above assumes that gg_fticr_domains is used, further down in the pipeline, as a dependency for something else. If it's not, don't create a target for it! In that case just replace a whole group of plot-generating steps with something like

    gticr_replication_plots =  make_gticr_replication_plots(...),   # plots are saved to disk by function

I'm opening a PR with some basic changes that enables the pipeline to run...at least until it hits

target peakcounts_trt
fail peakcounts_trt
Error: Target `peakcounts_trt` failed. Call `diagnose(peakcounts_trt)` for details. Error message:
  Join columns must be present in data.
✖ Problem with `SampleAssignment`.

Hope this is useful. I'm not trying to criticize, just to help you move in right direction. In my experience drake really is super cool for this kind of thing, and I feel terrible you've had such a crappy experience with it.

That definitely helps, and we may need to have a detailed conversation on drake and best practices.
I created this drake pipeline to feed into rmarkdown. My fticr markdowns are usually really big, and I was trying to make them more efficient -- so here, I was plotting all the graphs in drake, and just calling the relevant graphs/tables in the rmarkdown report (see the "report1" target).

OK if I work on and open a PR working on your pipeline?

Yes, of course

damn, that was a lot of edits. Thanks!
I understand what you did here, but it kind of seems a round-about way to plot graphs -- functions -> drake plan -> feed into Markdown.

We should discuss good workflow design.

From my point of view, the markdown document is target (probably one of the last ones) in the pipeline; it depends on everything that's come before. But if you're just producing a single markdown, yes, could just do all the computations and analysis in the document itself–in fact, that what I would do. Simpler and cleaner. The drake/target approach starts to get really useful if you have multiple targets (e.g. documents) that depend on an overlapping set of earlier targets.

Yes, let's discuss I think.