targets and tarchetypes
wlandau opened this issue Β· 71 comments
Submitting Author: Name (@wlandau)
Repository:
targets
: https://github.com/wlandau/targetstarchetypes
: https://github.com/wlandau/tarchetypes. (tarchetypes
is a small companion package that only serves to extendtargets
, so thought it appropriate to submit it for review along withtargets
.)- The manual: https://github.com/wlandau/targets-manual. (For https://github.com/ropensci-books.)
Version submitted:
- 0.0.0.9002 (
targets
) - 0.0.0.9000 (
tarchetypes
and the manual)
Editor: @maurolepore
Reviewer 1: @limnoliver
Reviewer 2: @tjmahr
Archive: TBD
Version accepted: TBD
DESCRIPTION files
targets
Package: targets
Title: Dynamic Function-Oriented Make-Like Declarative Pipelines for R
Description: The targets package is a pipeline toolkit that brings together
function-oriented programming and Make-like declarative workflows for
Statistics and data science in R. It implements a workflow as collection of
interconnected tasks, analyzes the dependency relationships among these
tasks, skips steps that are already up to date, runs the necessary
computations with optional parallel workers, abstracts files as
R objects, and provides tangible evidence that the results match
the underlying code and data. The methodology in this package
borrows from GNU Make by Richard Stallman (2015, ISBN:978-9881443519)
and drake by Will Landau (2018) <doi:10.21105/joss.00550>.
Version: 0.0.0.9001
License: MIT + file LICENSE
URL: https://wlandau.github.io/targets/, https://github.com/wlandau/targets
BugReports: https://github.com/wlandau/targets/issues
Authors@R: c(
person(
given = c("William", "Michael"),
family = "Landau",
role = c("aut", "cre"),
email = "will.landau@gmail.com",
comment = c(ORCID = "0000-0003-1878-3253")
),
person(
family = "Eli Lilly and Company",
role = "cph"
),
person(
given = c("Matthew", "T."),
family = "Warkentin",
role = "ctb"
))
Depends:
R (>= 3.5.0)
Imports:
callr (>= 3.4.3),
cli (>= 2.0.2),
codetools (>= 0.2.16),
data.table (>= 1.12.8),
digest (>= 0.6.25),
igraph (>= 1.2.5),
R6 (>= 2.4.1),
rlang (>= 0.4.5),
tibble (>= 3.0.1),
tidyselect (>= 1.1.0),
utils,
vctrs (>= 0.2.4),
withr (>= 2.1.2)
Suggests:
aws.s3 (>= 0.3.21),
clustermq (>= 0.8.9),
curl (>= 4.3),
dplyr (>= 1.0.0),
fst (>= 0.9.2),
future (>= 1.17.0),
keras (>= 2.2.5.0),
knitr (>= 1.30),
rmarkdown (>= 2.4),
qs (>= 0.23.2),
rstudioapi (>= 0.11),
testthat (>= 2.3.2),
torch (>= 0.1.0),
usethis (>= 1.6.3),
visNetwork (>= 2.0.9)
Encoding: UTF-8
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1.9000
VignetteBuilder: knitr
tarchetypes
Package: tarchetypes
Title: Archetypes for Targets
Description: The targets package is a pipeline toolkit that brings together
function-oriented programming and Make-like declarative workflows for
Statistics and data science in R. The tarchetypes package provides
convenient user-side functions to create specialized targets,
making pipelines easier to create and read. The methods in this package
were influenced by the drake R package by Will Landau (2018)
<doi:10.21105/joss.00550>.
Version: 0.0.0.9000
License: MIT + file LICENSE
URL: https://wlandau.github.io/tarchetypes/, https://github.com/wlandau/tarchetypes
BugReports: https://github.com/wlandau/tarchetypes/issues
Authors@R: c(
person(
given = c("William", "Michael"),
family = "Landau",
role = c("aut", "cre"),
email = "will.landau@gmail.com",
comment = c(ORCID = "0000-0003-1878-3253")
),
person(
family = "Eli Lilly and Company",
role = "cph"
))
Depends:
R (>= 3.5.0)
Imports:
fs (>= 1.4.2),
rlang (>= 0.4.7),
targets,
tidyselect (>= 1.1.0),
utils,
vctrs (>= 0.3.4),
withr (>= 2.1.2)
Suggests:
digest (>= 0.6.25),
knitr (>= 1.28),
rmarkdown (>= 2.1),
testthat (>= 2.3.2)
Remotes:
wlandau/targets
Encoding: UTF-8
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1.9000
- Manual
Package: targets.manual
Title: Targets R Package User Manual
Description: This repository contains the source files of the targets R
package user manual.
Version: 0.0.0.9000
License: MIT + file LICENSE
URL: https://wlandau.github.io/targets-manual,
https://github.com/wlandau/targets-manual
BugReports: https://github.com/wlandau/targets-manual/issues
Authors@R: c(
person(
given = c("William", "Michael"),
family = "Landau",
role = c("aut", "cre"),
email = "will.landau@gmail.com",
comment = c(ORCID = "0000-0003-1878-3253")
),
person(
family = "Eli Lilly and Company",
role = "cph"
))
Depends:
R (>= 3.5.0)
Imports:
biglm (>= 0.9.2),
bookdown (>= 0.19),
fs (>= 1.4.1),
purrr (>= 0.3.4),
tarchetypes,
targets,
tidyverse (>= 1.3.0),
visNetwork (>= 2.0.9),
withr (>= 2.2.0)
Remotes:
wlandau/tarchetypes,
wlandau/targets
Encoding: UTF-8
Language: en-US
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.0
VignetteBuilder: knitr
Scope
-
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- data deposition
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
-
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
targets
is an R-focused pipeline toolkit for Make-like declarative workflows. It resolves the dependency relationships among steps of a data analysis workflow and skips steps that are already up to date.
- Who is the target audience and what are scientific applications of this package?
targets
is for R users who maintain computationally intense function-oriented data analysis projects (with large codebases and/or long runtimes). Such projects may include but are not limited to Bayesian statistics, simulation, machine learning, PK/PD, and spatial statistics.
- Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
targets
is the long-term successor to drake
. After four years of development, drake
has improved so much that its insurmountable problems have become its most pressing ones. A new package is necessary to advance the capability further. So while I still believe drake
is thriving, and even though I will continue to maintain drake
indefinitely, I created targets
to try to break new ground. At https://wlandau.github.io/targets/articles/need.html#drake, I take a detailed dive into the ways that targets
surpasses drake
's permanent limitations.
- (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
N/A
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
N/A
Technical checks
Confirm each of the following by checking the box.
- I have read the guide for authors and rOpenSci packaging guide.
This package:
- does not violate the Terms of Service of any service it interacts with.
- has a CRAN and OSI accepted license.
- contains a README with instructions for installing the development version.
- includes documentation with examples for all functions, created with roxygen2.
- contains a vignette with examples of its essential functions and uses. To increase modularity and reduce package check time, all the user-side vignettes actually live at https://github.com/wlandau/targets-manual (deployed to https://wlandau.github.io/targets-manual). Repos at https://github.com/wlandau/targets-minimal, https://github.com/wlandau/targets-stan, and https://github.com/wlandau/targets-keras have the complete source code for example use cases. To avoid encumbering core
targets
and to avoid maintaining duplicated documentation, the vignettes of the actual package only include the statement of need and design documents. The README is deliberately short and links to all this existing documentation. - has a test suite. Not all of the tests in
targets
can be automated, especially when it comes to visualization and profiling, so many of the tests live in non-testthat
folders in https://github.com/wlandau/targets/tree/master/tests. Whenever I use a#nocov
block, I always include a comment with a justification and/or a reference to one of these semi-automated tests. - has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.
Publication options
- Do you intend for this package to go on CRAN?
- Do you intend for this package to go on Bioconductor?
- Do you wish to automatically submit to the Journal of Open Source Software? If so:
JOSS Options
- The package has an obvious research application according to JOSS's definition.
- The package contains a
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.I have written a paper.md, but I need to run it through my company's scientific disclosure process before I share it. That could take a few weeks.paper.md
andpaper.bib
now disclosed and included insideinst/
. - The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
- The package contains a
- Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)
Code of conduct
- I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.
Hello @wlandau for the submission. Before I address it further, could you elaborate on the need for submitting both packages on a single submission? Could this be split into two different submissions? (Thinking about potential reviewer workload). Please, be thorough.
tarchetypes
is a collection of user-level utilities for targets
, and it does not stand on its own. Its purpose is to make targets
easier to use, and it requires prior familiarity with the latter. I originally considered implementing both in a single package, and I still view their physical separation as an implementation detail. So I thought that separate submissions would potentially duplicate the overall reviewer-side overhead of learning targets
, especially compared to the relatively small size of tarchetypes
. In addition, tarchetypes
helpers like tar_render()
also appear in the manual.
But I am still happy to split up the submission if you think it best. Please let me know what you decide.
Hello @wlandau, thanks for the clarifications. @maurolepore will be your handling editor.
@wlandau, thanks for the high quality of this submission. I'm pleased to be the
editor.
Here I first give an overview of what I see, then ask for more information to
better understand if this or another structure is the best for this submission,
and finally comment on the editor checks. Please note the comments preceded by
a bullet point and respond to those preceded by a check-box; you may refer to their
labels.
In this submission I see three packages: targets
, tarchetypes
, and
targets.manual
. I understand that (a) targets
supersedes the popular package
drake
, also maintained by you; (b) tarchetyes
extends targets
; and (c)
targets.manual
would be published as an rOpenSci book containing the manual
for both targets
and tarchetypes
. Please correct me if necessary.
With three packages, this submission is more complex than the typical 1-package
submission. This may add extra pressure on resources such as rOpenSci reviewers,
and rOpenSci and CRAN servers. Please help me understand why this is the best
alternative (if possible, point me to the source of information that guided your
decision, e.g. books, articles, conversations with other developers):
-
(ml01) How did you decide to supersede
drake
instead of extend it? Did
you considered any other option, such as the "edition" approach of testthat 3? -
(ml02) How did you decide that
targets
andtarchetypes
should be two
separate packages? -
(ml03) Who would you suggest as a reviewer for this submission? Please
consider people with skills that complement yours, specially if they have
experience restructuring popular R packages.
Your answers will clarify the fit -- which I leave unchecked.
targets
Editor checks:
- Fit: The package meets criteria for fit and overlap
- Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
- License: The package has a CRAN or OSI accepted license
- Repository: The repository link resolves correctly
- Archive (JOSS only, may be post-review): The repository DOI resolves correctly
- Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?
Editor comments
README
-
(ml04) Under What about
drake? I see "It [is] has
become ...". You may remove "is". -
(ml05) Under Help, where it
says "Post to the [GitHub issue tracker]", the link seems incorrect:
https://github.com/wlandau/issues.
- (ml06) In general, I see an opportunity to remove duplication and thus make
the package easier to maintain. For example, the file "README.Rmd" links to
multiple pages under the same website; a single link to the website might be
enough and easier to maintain. Similarly, under
Help I see links to repositories
other thantargets
. This may be hard to maintain. In isolation, this may seem
like a minor issue, but minor issues can compound quickly.
CI
- (ml07) In the file "DESCRIPTION" I see
Depends: R (>= 3.5.0)
, but the
file ".github/workflows/check.yaml" covers R-release only. Please extend the
continuous-integration workflow to run R CMD check from R-devel to the R
version stated in the file "DESCRIPTION"
(?usethis::use_github_action_check_full()
).
tarchetypes
Editor checks:
- Fit: The package meets criteria for fit and overlap
- Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
- License: The package has a CRAN or OSI accepted license
- Repository: The repository link resolves correctly
- Archive (JOSS only, may be post-review): The repository DOI resolves correctly
- Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?
Editor comments
- (ml08) Same as ml06.
- (ml09) Same as ml07.
targets.manual
Editor checks:
- Fit: The package meets criteria for fit and overlap
- Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
- License: The package has a CRAN or OSI accepted license
- Repository: The repository link resolves correctly
- Archive (JOSS only, may be post-review): The repository DOI resolves correctly
- Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?
Editor comments
-
(ml10) The name of the package and GitHub repository don't match: "targets.manual" (with ".") and "target-manual" (with "-"). Because the symbol "-" is invalid in the name of an R package, consider renaming the repository to "targets.manual". The inconsistency might confuse some users.
-
(ml11)
devtools::install_github()
failed withupgrade = "default" and passed with
upgrade = "never"`. The error seems unrelated to the package but please check.
devtools::install_github("wlandau/targets-manual", dependencies = TRUE,
build_vignettes = TRUE)
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo wlandau/targets-manual@HEAD
#> readr (1.3.1 -> 1.4.0) [CRAN]
#> Installing 1 packages: readr
#> Installing package into '/home/mauro/R/x86_64-pc-linux-gnu-library/4.0'
#> (as 'lib' is unspecified)
#> Error: Failed to install 'targets.manual' from GitHub:
#> (converted from warning) installation of package 'readr' had non-zero exit status
- (ml12) R CMD check found 1 ERROR. Please fix the error or explain if R CMD check is not applicable to this type of package.
#> > checking package dependencies ... ERROR
#> VignetteBuilder package not declared: βknitrβ
- (ml13) Tests seem not applicable to the book manual. I see no test in other
books, e.g. https://github.com/ropensci-books/http-testing and
https://github.com/ropensci-books/taxize.
-
(ml14) I see a mismatch between the copy right holder in DESCRIPTION and
LICENSE (but the copy right holder in DESCRIPTION and LICENSE.md do match). Please ensure they match. -
(ml15) Please update the words list.
devtools::spell_check()
#> WORD FOUND IN
#> βs index.Rmd:55
#> indivdual index.Rmd:91
#> learnings index.Rmd:55
#> repo README.md:5
#> toolkits index.Rmd:45,47,51
#> zenodo README.md:3
Reviewers:
Reviewer: @limnoliver
Reviewer: @tjmahr
Due date: 2020-12-19*
*
This is four weeks from the day a second reviewer confirmed. I discussed with the chief editor @melvidoni to allow one week more than usual, because of COVID and because this submission includes 2 packages.
Thank you for your insightful feedback, @maurolepore. Below, I address ml04-ml15. (I will respond to ml01-ml03 in a separate post.)
- ml04: odd, I thought I removed that grammatical typo earlier. Fixed in ropensci/targets@454ddf5.
- ml05: link fixed in ropensci/targets@a3491b1.
- ml06: I agree, and avoiding duplication is one of my major documentation goals for
targets
. Addressed in ropensci/targets@a3491b1. - ml07: I added new CI checks under R 3.5 and 3.6 in ropensci/targets@b1a3261
- ml08: I deduplicated some links in ropensci/tarchetypes@d2915e9. I prefer to keep the existing links to individual package functions.
- ml09: addressed in ropensci/tarchetypes@c2e94ea.
- ml10-ml13: I intend targets-manual to just be a
bookdown
manual, not an installable package. The only purpose of theDESCRIPTION
was to make CI easier. To reduce confusion, in ropensci-books/targets@a2092f9 and ropensci-books/targets@5a69d0c, I removed theDESCRIPTION
and moved the relevant content intoREADME.md
andpackages.R
. - ml14: also fixed in ropensci-books/targets@a2092f9 and ropensci-books/targets@5a69d0c.
DESCRIPTION
andLICENSE
are gone, andREADME.md
is consistent withLICENSE.md
. - ml15: without a
DESCRIPTION
,devtools::spell_check()
no longer works. However, I did fix a spelling issue in ropensci-books/targets@7ed0388 ("indivdual" => "individual").
Thanks @wlandau for addressing my comments.
- ml10-13 and ml15:
The only purpose of the DESCRIPTION was to make CI easier. To reduce confusion, in ropensci-books/targets@a2092f9 and ropensci-books/targets@5a69d0c, I removed the DESCRIPTION and moved the relevant content into README.md and packages.R.
- (ml16) I support this (more on ml18 below). Now that I understand it, I prefer your original take -- it aligns with the idea of "research compendium". Please revert to it, then explain the goal, structure, and usage of the repository in the field "Description" of the file DESCRIPTION.
- (ml17) Even if
targets.manual
is not meant to be built and installed as a package, consider adding the minimum infrastructure to the repository fordevtools::check()
to pass cleanly -- that seems like the path of least resistance. If this is a bad idea, maybe write a broken test with the error message you want someone like me to read.
--
- (ml18) I support anything that makes your work less error prone and easier to maintain. My comments ml01-ml03 focused on external resources, but I forgot to stress that my main concern is maintainability. Your work is outstanding; the R community will benefit from you investing time in the important things instead of mundane tasks. If those mundane tasks are avoidable (e.g. with continuous integration, an "edition" approach, or rethinking the architecture of you system), then my goal is to ensure you consider those options. Collectively, the rOpenSci community has a lot of experience that should help you make informed, good choices.
- ml18: Thanks, @maurolepore. This helps me answer your questions. The way I am structuring things now, including the separation into multiple repositories, is largely to support maintainability. I will elaborate in my subsequent posts.
Below, I answer ml02 and ml03, and I address the physical separation of targets-manual
and other documentation. ml01 requires a much deeper answer, so I will address it in its own post.
ml02
I have had more time to reflect since @melvidoni first brought this up, and I will try to elaborate.
I could have easily implemented targets
and tarchetypes
in the same package. Both are essentially the same system, and tarchetypes
is currently very small, supporting only a limited handful of superficial extensions to targets
' deeper capabilities. So for the present, the physical separation is a trivial implementation detail, and I do not believe it will exacerbate the burden of software review.
The separation of tarchetypes
is about planning for the future. From what I learned maintaining drake
and observing use cases, tarchetypes
will enhance maintainability, the quality of the infrastructure and user-side freedom for years to come.
Sustainable infrastructure
Interface development incurs additional challenges, code volume, bugs, tests, and documentation. I learned this the hard way while developing static branching in drake
. Because drake
's design did not allow me to implement static branching in a separate package, drake
itself became more difficult to maintain, more prone to feature creep, and more prone to errors (see ropensci/drake#1199, ropensci/drake#1262, ropensci/drake#1010, ropensci/drake#1009, ropensci/drake#1008, and many more issues like these). And although I fixed all the known and reported bugs, the deeper underlying causes only worsened as the years went by, and the fundamental design of the internals and the interface made it impossible to eliminate these problems in drake
itself.
As its own package, tarchetypes
has plenty of room to explore alternative interfaces and shorthand for targets and pipelines. That way,targets
itself will stay light, elegant, and sharply delineated in scope, and the architecture will remain clean, resilient, and sustainable in the long run.
A precedent for extensibility
In addition, tarchetypes
deliberately sets a precedent for enhanced automation through third-party interface development. tarchetypes
functions like tar_render()
and tar_map()
are easy to implement using targets::tar_target_raw()
, and even now, developers are already borrowing the pattern to develop their own specialized interfaces. Internally at work, for example, one of my colleagues maintains a package for domain-specific simulation studies in the life sciences, and it has custom target archetypes tailored to the specific needs of these internal pipelines. An earlier version of this same package used drake
, and the metaprogramming on top of drake_plan()
was messy and nearly impossible.
targets-manual
In the early days of drake
, all the documentation lived in the README and vignettes. Over time, the vignettes steadily accrued volume, length, and R package dependencies, and drake
became difficult to install. So I moved the vignettes to their own bookdown
project, and the problem was solved.
With targets-manual
, I am trying to do the same thing right from the start, even though it is much shorter than drake
's manual at the moment. So targets-manual
is just a collection of user-side vignettes that happens to be in its own GitHub repository. As with tarchetypes
, I do not believe the physical separation of targets
and targets-manual
increases the burden of software review relative to unifying them in a single package. Conceptually, they still belong together.
Other documentation
targets
has additional repositories with extra documentation. They help people learn the package, but I am not submitting them to rOpenSci.
- https://github.com/wlandau/targets-minimal
- https://github.com/wlandau/targets-stan
- https://github.com/wlandau/targets-keras
- https://github.com/wlandau/targets-tutorial
(1)-(3) are example projects that use targets
, and (4) is a half-day short course. Again, I find the physical separation to be important. For drake
, I made the mistake of curating multiple example projects in https://github.com/wlandau/drake-examples, and the repo grew without bound. With different projects in different repos now, everything is easier to maintain. The code is easier to fix, and it is easier to set up RStudio Cloud workspaces to help people learn. In addition, the pattern demonstrates to users that each targets
-powered data analysis workflow should live in its own repo, which is exactly how I have observed most successful projects to be implemented in practice.
ml03
My top reviewer recommendations are power users who have already provided extensive feedback and expressed strong enthusiasm about targets
. Their prior familiarity could reduce the workload.
Otherwise, it may help to reach out to folks who have spread drake
throughout the community. Here are some names that come to mind.
- @krlmlr
- @thebioengineer
- @aedobbyn
- @matt-dray
- @matthewstrasiotto
- @tjmahr
- @sinarueeger
- @cstawitz
- @bpbond
- @gadenbuie
- @billdenney
- @tiernanmartin
- @b-rodrigues
- @DominikRafacz
- @pat-s
- @maelle (editor of
drake
's software review: #156) - @karthik (though I am not sure if co-founders also review packages)
Happy to be considered as a possible reviewer of targets
and tarchetypes
. I use both packages on a near-daily basis for many on-going projects, including projects that are developed and built as purely local, purely remote (live and run on HPC) and a mixture of local and remote (live local and selectively run locally or on HPC via SSH).
Thank you both for your eagerness to help!
@maurolepore, I addressed ml16 and ml17 in commits ropensci-books/targets@f792c14 through ropensci-books/targets@2ee3725.
Do you still think it is necessary to move https://github.com/wlandau/targets-manual to https://github.com/wlandau/targets.manual? If the manual is accepted into rOpenSci, I expect the hyphen to go away on its own when the URLs to move to https://github.com/ropensci-books/targets and https://books.ropensci.org/targets.
Other than that, I believe only ml01 remains for now.
I'm deeply appreciative to be considered to help out with targets
.
I've been following it's development with interest as a daily drake user on remote systems/hpc environments with an interest in usability and discoverability in data pipelines.
I've been hesitant to migrate my workflow over. From your description, it'll be valuable to do so and I look forward to trying it out.
I'll also be making an effort to update my extension package for drake, mandrake to support targets
.
While I'm very interested in targets
, I unfortunately can't help as a reviewer in the next month or two.
Almost forgot: @mattwarkentin is actually a formal contributor because of ropensci/targets#170, and @noamross has a PR at ropensci/tarchetypes#9 that I meant to keep open. @melvidoni and @maurolepore, does this mean they might have to recuse themselves?
Now, I will address ml01. I think it is the most pressing question from this thread so far, and it is definitely the most difficult to explain in complete depth. So please let me know if you are skeptical or if anything remains unclear or unresolved.
What all this means for drake
drake
is in a position of strength, and this is exactly what allowed me to create targets
in the first place. After four years of steady development, I feel that we have solved most of the solvable problems worth mentioning, which finally forced me to reckon with the unsolvable ones. targets
is not about remediation, it is about breaking new ground.
Maintenance
I will maintain drake
indefinitely. I will continue to provide one-on-one help with use cases, fix known bugs, address known inefficiencies, and consider new community-requested features. However, since I am not going to propose new feature ideas of my own or strike out on Odyssean refactoring adventures, maintenance will be far easier and less time-consuming. In fact, because targets
has a much cleaner design, the combined maintenance of drake
, targets
, tarchetypes
, and the docs could prove far less demanding than that of drake
alone up to this point.
Why not an edition?
I have been hearing quite a lot about testthat
editions lately. I read the vignette, and I just updated targets
and tarchetypes
so their test suites are compatible with both editions 2 and 3. So although I did not actually inspect the changes to the testthat
code base, I think I get the idea. From the last section of the vignette:
You might wonder why we came up with the idea of an βeditionβ, rather than creating a new package like testthat3. We decided against making a new package because the 2nd and 3rd edition share a very large amount of code, so making a new package would have substantially increased the maintenance burden: the majority of bugs wouldβve needed to be fixed in two places.
I think Hadley says it well. The breaking changes in testthat
would ordinarily be disruptive to users because of the thousands of reverse dependencies. However, the impact on the testthat
package itself is minor, and a totally new package would be overkill.
With drake
, similar changes have occurred a small handful of times in the past. In version 6, I restructured the cache in order to improve efficiency. The change was not back-compatible, and I configured drake
to error out at the right time, preserve existing data, and walk users through their available options. Similarly, version 7 removed all the obsolete parallel backends and threw a deprecation warning. These small but disruptive changes were manageable, and they did not require a new package.
This time, however, the change is tectonic. drake
's goals have transformed, and the package is on the brink of outgrowing its own architecture and design principles. More than that, my entire philosophy of programming has shifted, and I think completely differently about what a pipeline tool should do and how it should be designed. A new package seemed like the only way to make meaningful progress, and I am glad I went that direction.
Motivation
drake
is the largest and most fruitful software development undertaking I have ever attempted. Just working through the technical issues more than doubled my proficiency in R, and the success of the tool granted me access to a fountain of community wisdom. To say I learned a lot would be an understatement. I am not the same developer I was four years ago, and I think totally differently about software.
For the purposes of targets
, if I had to identify a single tipping point, it would probably be a book recommendation from Jim Hester: Design Patterns: Elements of Reusable Object-Oriented Software, the classic text by Gamma, Helm, Johnson, and Vlissides. At the time, I already had some experience with traditional message-passing OOP, but I was still confused about the problems it solved, and I was unsure how to apply it properly. The first few chapters finally answered questions about when to use object composition instead of inheritance, the importance of mutable objects for certain use cases, and how to choose a mental model that fits the task at hand.
I began to see the limitations of drake
's design. drake
uses simple immutable data structures such as ad hoc lists, some of them god objects. As a result, it has serious trouble expressing features such as static branching, dynamic branching, and dynamic files. Seamless targets
-style cloud integration was beyond reach, and compatibility between dynamic files and data recovery was nearly impossible and super messy to enforce.
Finally, I realized I would no longer be able to make nontrivial improvements to drake
as long as it relied on the old infrastructure. The path forward needed not only a complete rewrite and a complete reevaluation of every feature, but also an entirely different programming paradigm and philosophy.
A new design
Internally, targets
is a collection of interdependent, carefully constructed, sharply defined, lightweight R6 and S3 classes with mutable instances. I built these 74 classes from the ground up in order to express the complicated reasoning that happens both within and among targets, and the mental model easily captures the behavior of a dynamic pipeline tool. This internal harmonization paved the way for several improvements that were impossible in drake
, some of which appear in the statement of need, notably cloud storage integration, interface extensibility, and parallel-efficient flexible dynamic branching.
Let's take dynamic branching as an example. drake
struggles with this because it does not have a formal structure to express what it means to be a target. targets
, on the other hand, supports a formal inheritance hierarchy of classes to express the unique role of each target in the dynamic branching process. For example, "stems" are unbranched targets capable of producing "buds", and "patterns" are special templates that generate the actual branches. (This design document overviews the different kinds of targets and their role in dynamic branching.) All instances are mutable, and the mental model completely aligns with the realities of implementation. There is rigidness where we need rigidness, and there is flexibility where we need flexibility. This is what allows targets
to overcome the permanent map-reduce efficiency limitation of drake
.
And with that, I think I am all caught up on editor requests for the moment. When the time comes, please let me know what else I can do.
An aside: you can read a lot into the package names here. "drake" is an acronym: "data frames in R for Make". In the first few versions, drake::make(parallelism = "Makefile")
would actually take a data frame of commands (a drake
plan) and turn it into a real Makefile. drake
has long since moved beyond this approach, and the "Makefile" backend is no longer supported. In other words, drake
outgrew its own name.
The name "targets" represents a different way of thinking about the design. Whereas drake
's data structures operate at the level of the entire pipeline, nearly all the structure and all the reasoning of targets
happens on an individual target-by-target basis. In targets
, target objects are first-class citizens, and they express almost all of the behavior through method dispatch and through the nested OOP objects that compose them. This reductionism really fits the situation, and it ensures the mental model aligns with the realities of implementation.
Thanks for thinking of me as a reviewer! I don't have much time to offer, but if I'm useful, I'll be happy to get involved, especially if there's a specific task to do.
Thanks @wlandau for addressing my comments. I'll come back to you likely next week. I would like to read your comments in greater detail, and discuss with other editors what might be the best two reviewers for this submission. But right now I'm on a short vacation with limited access to internet. I'm excited about how interesting this discussion already is. Thanks! π
Thanks @wlandau for your patience. I can now confirm this submission satisfies editor checks. I'll start seeking reviewers (thanks for your suggestoins).
Please ensure the README files of each package in this submission has an rOpenSci review badge -- maybe via --rodev::use_review_badge()
, rodev::use_review_badge(<issue_number>)
. Badge URL is https://badges.ropensci.org/<issue_id>_status.svg. Full link should be:
RE #401 (comment)
Thanks for the heads up. If a potential conflict of interest is unclearly defined in the guidelines, I'll discuss with other editors. For now you should not worry.
Thanks @wlandau for your patience. I can now confirm this submission satisfies editor checks. I'll start seeking reviewers (thanks for your suggestions).
So glad to hear it!
Please ensure the README files of each package in this submission has an rOpenSci review badge -- maybe via --rodev::use_review_badge(), rodev::use_review_badge(<issue_number>).
Now added.
Also, just an FYI: my employer just approved my JOSS manuscript for disclosure, so I added it to the targets
repo (paper.md
and paper.bib
inside inst/
). I will update the original post in this thread.
We're cutting over to {targets}
for our project templates at work very soon, so that would give me an opportunity to review. I'd be happy to be reviewer if requested.
I do have one reservation and that is that I would be 'coming in hot', off the back of daily {drake}
use for the last 18 months, and frequent correspondence with Will. I do wonder if a lot of the value of a review is derived from 'fresh' eyes looking at the project critically? I wouldn't want to deny {targets}
that opportunity.
Edit: You can expect feedback from the cutover either way, @wlandau π
Thanks, @MilesMcBain! Your feedback has helped a lot already, particularly with interactive debugging support, and I look forward to more.
You also suggested this may be a good time to coordinate with RStudio, and which immediately brought more potential reviewers to mind:
And I originally meant to include @HenrikBengtsson and @mschubert.
I'm excited to announce that @limnoliver and @tjmahr will be reviewing this submission. Please notice this submission includes two closely related packages: "targets" and "tarchetypes".
There is a third package, "targets.manual", that provides documentation; your feedback on "targets.manual" is welcome but a full review is not required.
The due date for your reviews is 2020-12-19. This is four weeks from today -- a bit longer than usual considering this double-submission and COVID.
Please let me know if you have any issues, questions or comments. I look forward to working with you all.
Thanks for making this happen, @maurolepore. @limnoliver and @tjmahr, I am looking forward to your feedback.
I have a new package on GitHub called stantargets
, an extension to targets
for Stan-powered Bayesian data analysis. stantargets
makes it super easy to set up useful pipelines without having to write many functions or think about branching. It can access all the major algorithms of cmdstanr
(MCMC, variational Bayes, optimization), and it supports both single-fit workflows and multi-rep simulation studies (vignettes here).
I mention stantargets
here just to give you an overall sense of where I am taking targets
. I intend stantargets
to set a precedent for other methodology-specific workflow frameworks built on top of targets
, e.g. possibly for machine learning with keras
and torch
.
To give you more of a sense of where I am going with this, I just announced the targetopia, an ecosystem of packages to democratize Make-like pipelines in R. I hope to add more packages and cover more kinds of data science in 2021.
Maybe targetopia is slightly outside the scope of this review, but I would at least like to mention this: I envision it to reach more fields of data science than I have the expertise to cover. (I am starting with Bayesian statistics because that is where I am strongest.) So if anyone reading this would like to contribute a targetopia package, that would be amazing. I just added some tips at the bottom of https://wlandau.github.io/targetopia.html, and I would be eager to get in touch.
Can I have an extension on this review? I'm sorry this review period overlaps with some funky holiday scheduling.
Thanks @tjmahr for acknowledging the deadline. It is indeed a difficult time. When would you be ready?
Similarly, I could use an extension! I'll be able to complete a review by Dec. 29.
Thanks @limnoliver. I anticipated this time would be inconvenient. Dec 29 would be okay.
I think academic peer review in general is already generous and altruistic, not to mention daunting. The holidays, COVID, and the volume of this submission compound all that so much. So I am extremely grateful that you signed up, and I will do whatever I can to recognize your efforts. December 29 is small as delays go, not a big deal on my end.
Dear @limnoliver and @tjmahr,
I hope you are having a good end of the year.
I imagine this might be a difficult time to review a package. Do you need extra time or some other help?
Thanks again for your generosity. I hope you don't mind my reminders.
I expect to find time in the first week of the new year. I created an issue and Will resolved my biggest hang up with the package so far.
Sorry for the delay! The review took a big longer than expected, but I really enjoyed doing it. Kudos to you @wlandau, for creating such a useful tool. For context, I'm coming at this review with experience using pipelines in R (using remake
and a wrapper package to handle errors, shared caching, and what you're calling branching), but am almost completely naΓ―ve to drake
.
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- Briefly describe any working relationship you have (had) with the package authors.
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- A statement of need clearly stating problems the software is designed to solve and its target audience in README
- Installation instructions: for the development version of package and any non-standard dependencies in README
- Vignette(s) demonstrating major functionality that runs successfully locally
- Function Documentation: for all exported functions in R help
- Examples for all exported functions in R Help that run successfully locally
- Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL
,BugReports
andMaintainer
(which may be autogenerated viaAuthors@R
).
Comments on documentation:
-
In the
Contributing
document, you reference a three-paragraph format, but don't give an example in sectionIssues
. -
I read a lot of your justification for why documentation is in certain places, and this all made sense to me (and overall I thought your documentation was accessible and thorough enough). However, I got tripped up in a few places, so here are some suggestions based on my experience. I don't think the
Statement of Need
vignette belongs in the package. A lot of this info is duplicated in the user manual, and it feels like a better fit in the manual because there is a lot of narrative relative to code/demonstrating functions withintargets
. Overall, I don't think the three vignettes intargets
help someone use the functions within the package. "The composition of target objects" totally caught me off guard because I had read the entire manual and never saw mention of these target classes. While overall this vignette helps me to understand the package, it didn't help me use it. In the "orchestration and branching model", I found it hard to wrap my head around how or when I would interact with object classes without concrete examples preceding it. I think this is important info, but if I had to pick three vignettes for the package itself, I would maybe stick this in the manual and opt for more of a how-to vignette (or two or three). I think thetargets
package itself would benefit from a "Getting Started" vignette. While this info is elsewhere (through the tutorial or minimal example), I found the vignettes for the package itself to not be a good starting point for a new user like myself. There are hours and hours of documentation across your ecosystem (which is great!), but how can someone get started in under an hour? This should be the role of the vignettes (in my opinion!). -
In the "skipping up-to-date targets" section of the "Data and metadata management" vignette, I think it's supposed to say "targets checks these rules in the order given below"
-
In section 2.2 of "The orchestration and branching model", in the example pipeline, it's unclear how "all analyses use the entirety of data1" because I don't see that target used in any other targets?
-
In
targets-manual
, chapter 7 introduces a pipeline in the opening section, but the code chunk does not contain a pipeline, only βtar_visnetwork()β. I think the pipeline that is supposed to appear is below in section 7.1.2? -
It would be nice to have a section on "cleaning up", and best practices around when to use some of the cleaner functions. I came across old targets that no longer existed when I was experimenting, and I wasn't sure what the best approach was to cleaning up the pipeline.
-
In the manual in section
4.1 Internal Files
, it says the progress file is_targets/meta_progress
, but in my experience this was_targets/meta/progress
. -
Some of the examples in the help files are more helpful than others. I appreciated examples like in
?tar_make
that actually can run a complete example. In cases like ?tar_cancel, because it's not a complete example (andis_monday()
does not exist) we don't get to really see it in action, and it took me slightly longer to think about what how I should use it in my pipeline, and what the consequences of cancelling a target would be (I was pleasantly surprised to see that the pipeline skips it as if it is up-to-date...this is a really handy function!).
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software been confirmed.
- Performance: Any performance claims of the software been confirmed.
- Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
Comments: Packages passed devtools::check()
, devtools::test()
, goodpractice::gp()
. covr::package_coverage()
indicated 96% coverage, and I believe the author explained manual tests in the convo above. For tarchetypes
there were 14 warnings about testthat
:
Warning (test-tar_render_raw.R:44:1): (code run outside of `test_that()`)
The `code` argument to `test_that()` must be a braced expression to get accurate file-line information for failures.
Backtrace:
1. testthat::test_that(...) test-tar_render_raw.R:44:0
Despite the power and complexity of the package, overall I thought the user interface was great. It was easy to look at the list of functions, guess what they did, and get to a point of understanding what they did.
Estimated hours spent reviewing: 10+ (I would call it 10 critically thinking about the package/documentation, but spent much more time teaching myself targets
in anticipation of using it.)
- Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.
testing package functions
I wanted to test the debug functions. I built upon your example, and came across something I wasn't expecting. I often want to expose important choices/options in the function arguments so anyone else using the pipeline doesn't have to dig into the functions to find something. In my example, I expected the value of y to also be included in the workspace environment, but it wasn't:
options(tidyverse.quiet = TRUE)
library(targets)
library(tidyverse)
options(crayon.enabled = FALSE)
tar_option_set(error = "workspace")
f <- function(x, y) {
stopifnot(x < 4)
print(y)
}
tar_pipeline(
tar_target(x, seq_len(4)),
tar_target(y, f(x, 'hello!'), pattern = map(x))
)
# in console
tar_make()
#* run target x
#* run branch y_80b9a5d2
# "hello!"
#* run branch y_54f0a599
# "hello!"
#* run branch y_be82812d
# "hello!"
#* run branch y_dca2bb53
#x error branch y_dca2bb53
#* record workspace y_dca2bb53
tar_workspace(y_dca2bb53)
print(x)
# [1] 4
print(y)
# Error in print(y) : object 'y' not found
Going back to inspect the manual, I figured out why this is happening (targets doesn't recognize y as a dependency, y is not in the global environment). I expected the workspace to function similar to browser(), so this threw me off when it wasn't a fully ready interactive environment. Maybe explicitly include more arguments in your top-level functions in the arguments to demonstrate the limitations there, and solutions/tradeoffs? In a bigger pipeline, it would be cumbersome to declare those as targets themselves, and I think the other option is to set those variables in the _targets.R script. Could you talk about how you handle those limitations? (Now after reviewing tarchetypes
, I also realize the targets can get simpler with tar_plan
.)
To learn the package and test the core functionality, I tried out a common use-case for us. Currently, we use remake and an
internal wrapper package (example here) that provides shared cache, error handling, and task splitting (similar to branching) services on top of remake. The ease of creating this example made me really, really excited about targets!
library(targets)
library(dataRetrieval)
library(MESS)
library(dplyr)
options(tidyverse.quiet = TRUE, crayon.enabled = FALSE)
# use case:
# want to retrieve data from the National Water Information System (NWIS)
# problem for big calls is that service times out or errors if
# too many records returned or too many sites called at once.
# Solution is to figure out what sites are available and how much data
# each site has, and download in batches of sites. For a national pull, this can
# take a day or two, so a good use case for pipelining.
query_nwis <- function(states, parameters) {
sites <- dataRetrieval::whatNWISdata(stateCd = states, parameterCd = parameters, service = 'dv')
return(sites)
}
group_sites <- function(sites, max_results, max_sites) {
out <- sites %>%
filter(stat_cd %in% "00003") %>%
select(site_no, count_nu) %>%
distinct() %>%
arrange(desc(count_nu)) %>%
mutate(task_num = MESS::cumsumbinning(x = count_nu,
threshold = max_results,
maxgroupsize = max_sites)) %>%
group_by(task_num) %>%
targets::tar_group()
}
pull_data <- function(sites, parameters) {
dat <- readNWISdv(siteNumbers = sites$site_no, parameterCd = parameters)
return(dat)
}
tar_pipeline(
tar_target(states, c('RI', 'DE')), # NWIS accepts two-letter state abbreviations
tar_target(parameters, '00010'), # parameter is USGS code for temperature
tar_target(sites, query_nwis(states, parameters), map(states)),
tar_target(site_batches, group_sites(sites, max_results = 50000, max_sites = 50), iteration = 'group'),
tar_target(dat, pull_data(site_batches, parameters), pattern = map(site_batches))
)
One surprise digging into my example above was that tar_path(dat)
returns _targets/objects/dat
, despite that file not existing, and despite the path in the metadata being NA
. Should this return NA
, or the children names, or the filepath with a wildcard for clarity? I do appreciate that you can tar_read(dat)
or tar_load(dat)
with helpers behind the scenes.
A common use-case I have is once we get to doing site-level things (plotting, modeling), I'll notice a problem and want to
see the original data that was pulled from the web service. Or, the combined target is huge, and I just want to look at the (much smaller) branch that containes a particular site. How do I know which branch file site "01115120" is in, for example? Because in the code above, my batches are groups of sites, I want a dataframe of site IDs with corresponding filenames of which branches they belong to. I was expecting that crosswalk information to exist in _targets/meta/meta
. The manual spends a lot of time showing how you can check the pattern created, what order the branches are created, and tar_pattern gets close. I wonder if a function like tar_pattern could accept the name of a target with type == pattern
, and it produces a dataframe that shows the existing filenames and corresponding values of the variables that were mapped over? Admittedly, it wasn't too terrible to come up with what I wanted (see below), so at a minimum maybe include an example somewhere in the manual when discussing patterns and branching for those that want a post hoc crosswalk?
file_names <- filter(tar_meta(), name %in% 'dat') %>%
pull(children) %>% unlist()
crosswalk <- tar_read(site_batches) %>%
left_join(tibble(file_names,
tar_group = seq(length(file_names))))
# check if my assumptions were true!
crosswalk$file_names[1]
# [1] "dat_abef3008"
actual <- tar_read("dat_abef3008") %>%
select(site_no) %>% distinct() %>% pull(site_no)
predicted <- filter(crosswalk, file_names %in% "dat_abef3008") %>%
pull(site_no)
all.equal(sort(actual), sort(predicted))
# [1] TRUE
This is fantastic, Sam! I know the review was an enormous effort, and I cannot thank you enough! I will digest your feedback and respond this week.
Reply to @limnoliver's review
Thank you for your thorough and insightful review. Your feedback is super impactful, and I incorporated everything I could into the packages and documentation.
- In the Contributing document, you reference a three-paragraph format, but don't give an example in section Issues.
I actually ended up changing my mind about that. All mentions of that three-paragraph format should now be gone.
- I don't think the Statement of Need vignette belongs in the package. A lot of this info is duplicated in the user manual, and it feels like a better fit in the manual because there is a lot of narrative relative to code/demonstrating functions within targets.
Now that you mention it, I completely agree. I moved most of the statement of need into a new chapter on drake
: the future direction of drake
, how to transition to targets
, and technical details about the advantages of targets
. Together, I think that chapter and the intro chapter sufficiently motivate the package, and the shift deduplicates a lot of text. Really glad to have done this.
Overall, I don't think the three vignettes in targets help someone use the functions within the package. "The composition of target objects" totally caught me off guard because I had read the entire manual and never saw mention of these target classes. While overall this vignette helps me to understand the package, it didn't help me use it. In the "orchestration and branching model", I found it hard to wrap my head around how or when I would interact with object classes without concrete examples preceding it. I think this is important info, but if I had to pick three vignettes for the package itself, I would maybe stick this in the manual and opt for more of a how-to vignette (or two or three). I think the targets package itself would benefit from a "Getting Started" vignette. While this info is elsewhere (through the tutorial or minimal example), I found the vignettes for the package itself to not be a good starting point for a new user like myself. There are hours and hours of documentation across your ecosystem (which is great!), but how can someone get started in under an hour? This should be the role of the vignettes (in my opinion!).
Sorry, it must have been frustrating to expect friendly user-side documentation in the vignettes and find esoteric developer-side documentation instead. I just moved all those developer vignettes to a new bookdown
website at https://wlandau.github.io/targets-design and added a short new introductory vignette at and added a new section to the README to chart a smooth path through the material for new users.targets/vignettes/start.Rmd
. The vignette is pretty high-level, it describes the order in which to walk through resources and what to expect from each.
Beyond that, I am concerned about adding more vignettes to the targets
package itself. In addition to the potential tech debt I described before, it seems like it might create some ambiguity with the manual. (Which vignettes belong with targets
, which belong with targets-manual
, and how will users know when to use which resource?). In addition, https://github.com/wlandau/targets-tutorial is its own supply of introductory vignettes, and these come with an RStudio Cloud workspace to help users try out the code. Lastly, drake
users told me they sometimes struggled with the volume of reading material (and documentation in general) and I would like to be more conscious of that this time around.
- In the "skipping up-to-date targets" section of the "Data and metadata management" vignette, I think it's supposed to say "targets checks these rules in the order given below"
Should be fixed now in ropensci-books/targets-design@50d4280.
- In section 2.2 of "The orchestration and branching model", in the example pipeline, it's unclear how "all analyses use the entirety of data1" because I don't see that target used in any other targets?
Yup, that's a typo in the pipeline code. I meant to use data1
in the analysis
target. Fixed in ropensci-books/targets-design@52b74b7.
- In targets-manual, chapter 7 introduces a pipeline in the opening section, but the code chunk does not contain a pipeline, only βtar_visnetwork()β. I think the pipeline that is supposed to appear is below in section 7.1.2?
Yes, the pipeline and graph were supposed to appear. Should be fixed now.
- It would be nice to have a section on "cleaning up", and best practices around when to use some of the cleaner functions. I came across old targets that no longer existed when I was experimenting, and I wasn't sure what the best approach was to cleaning up the pipeline.
I just added a new section at the end of the chapter on best practices: ropensci-books/targets@7ae1334. It describes all the cleaning functions except tar_deduplicate()
. (tar_make()
etc. deduplicate the metadata automatically, so most users should not need to invoke it manually.)
- In the manual in section 4.1 Internal Files, it says the progress file is _targets/meta_progress, but in my experience this was _targets/meta/progress.
Definitely a typo, fixed in ropensci-books/targets@f481e60.
- Some of the examples in the help files are more helpful than others. I appreciated examples like in ?tar_make that actually can run a complete example. In cases like ?tar_cancel, because it's not a complete example (and is_monday() does not exist) we don't get to really see it in action, and it took me slightly longer to think about what how I should use it in my pipeline, and what the consequences of cancelling a target would be (I was pleasantly surprised to see that the pipeline skips it as if it is up-to-date...this is a really handy function!).
I went back and wrote runnable examples for the functions that lacked them:
tar_cancel()
tar_envir()
tar_name()
tar_option_get()
tar_option_reset()
tar_option_get()
tar_path()
tar_seed()
tar_target()
tar_target_raw()
Comments: Packages passed devtools::check(), devtools::test(), goodpractice::gp(). covr::package_coverage() indicated 96% coverage, and I believe the author explained manual tests in the convo above.
That's odd. I meant to write explicit #nocov
comments in all the places I gave up on automated coverage, and the online coverage reports show 100%. What do you see when you run devtools::test()
? Are any tests skipped because Suggests:
packages are missing?
For tarchetypes there were 14 warnings about testthat:
Yes, I reproduced that with testthat
version 3.0.1. It was because I was writing tests that look like test_that("", suppressMessages(...))
. I recently updated tarchetypes
to begin the test code with braces, and the warnings should be gone as of ropensci/tarchetypes@a461e1f.
Estimated hours spent reviewing: 10+ (I would call it 10 critically thinking about the package/documentation, but spent much more time teaching myself targets in anticipation of using it.)
Knowing what you know now, should I have opened a different submission for each of targets
, tarchetypes
, and the manual? Do tarchetypes
and the manual each make sense to review in isolation? How much time would that have saved you? The editors tried to warn me about the workload I would be inflicting on reviewers, and if I judged poorly, maybe this could help future submissions.
I wanted to test the debug functions. I built upon your example, and came across something I wasn't expecting. I often want to expose important choices/options in the function arguments so anyone else using the pipeline doesn't have to dig into the functions to find something. In my example, I expected the value of y to also be included in the workspace environment, but it wasn't: ... Going back to inspect the manual, I figured out why this is happening (targets doesn't recognize y as a dependency, y is not in the global environment). I expected the workspace to function similar to browser(), so this threw me off when it wasn't a fully ready interactive environment.
I updated the debugging section to try to clear up this ambiguity.
Maybe explicitly include more arguments in your top-level functions in the arguments to demonstrate the limitations there, and solutions/tradeoffs?
Yeah, I now see that it might have been confusing to have target named x
and function f()
whose only argument was also named x
. Now, f()
has multiple arguments, none of which is named x
.
In a bigger pipeline, it would be cumbersome to declare those as targets themselves, and I think the other option is to set those variables in the _targets.R script. Could you talk about how you handle those limitations?
Yes, that's a big advantage of the built-in interactive debugging support in targets
, as opposed to tar_workspace()
. I added some more explanation in ropensci-books/targets@9f97ac7 to try to clear things up.
To learn the package and test the core functionality, I tried out a common use-case for us. Currently, we use remake and an
internal wrapper package (example here) that provides shared cache, error handling, and task splitting (similar to branching) services on top of remake.
That's a really neat use case and a great example of tar_group()
in action.
One surprise digging into my example above was that tar_path(dat) returns _targets/objects/dat, despite that file not existing, and despite the path in the metadata being NA.
Looking back at the documentation, I realize now that the help file was not totally clear. My intention was to give tar_path()
a much simpler niche: just identify the path without worrying if it exists. This gives advanced users some building blocks for custom cues, as in ropensci/targets#131. Hopefully ropensci/targets@2d72e3c clears that up.
Should this return NA, or the children names, or the filepath with a wildcard for clarity?
I believe tar_meta()
covers this.
tar_meta(dat, path)$path[[1]]
#> [1] NA
tar_meta(dat, children)$children[[1]]
#> [1] "dat_ab76cb22" "dat_f2821206" "dat_b8dadf92"
A common use-case I have is once we get to doing site-level things (plotting, modeling), I'll notice a problem and want to
see the original data that was pulled from the web service. Or, the combined target is huge, and I just want to look at the (much smaller) branch that containes a particular site. How do I know which branch file site "01115120" is in, for example? Because in the code above, my batches are groups of sites, I want a dataframe of site IDs with corresponding filenames of which branches they belong to. I was expecting that crosswalk information to exist in _targets/meta/meta. The manual spends a lot of time showing how you can check the pattern created, what order the branches are created, and tar_pattern gets close. I wonder if a function like tar_pattern could accept the name of a target with type == pattern, and it produces a dataframe that shows the existing filenames and corresponding values of the variables that were mapped over? Admittedly, it wasn't too terrible to come up with what I wanted (see below), so at a minimum maybe include an example somewhere in the manual when discussing patterns and branching for those that want a post hoc crosswalk?
That's a tough one because the metadata file does not actually store dependency relationships (to keep storage light) and there could be a mismatch between the old _targets/meta/meta
and current _targets.R
. So I implemented a new tar_branches()
function that accepts a pattern definition and returns a data frame of branch names and their dependencies, all constructed from the metadata file alone.
tar_branches(dat, map(site_batches))
#> # A tibble: 3 x 2
#> dat site_batches
#> <chr> <chr>
#> 1 dat_ab76cb22 site_batches_d249d09b
#> 2 dat_f2821206 site_batches_eaa4bb57
#> 3 dat_b8dadf92 site_batches_3a2d01e7
It's an improvement, but it's not perfect. If site_batches
were a pattern, you could just call readd(site_batches_eaa4bb57)
to check the site of branch dat_f2821206
. But site_batches
is a stem, so site_batches_eaa4bb57
is a bud, not a branch. In other words, there is no actual value of site_batches_eaa4bb57
in storage. (And I do not think it would be wise to reconstruct the actual data in each bud because it could get quite large.) We have a similar problem when we match sites
and states
.
tar_branches(sites, map(states))
#> # A tibble: 2 x 2
#> sites states
#> <chr> <chr>
#> 1 sites_83036b8f states_ce340191
#> 2 sites_2b7c2325 states_14e6a3f8
So I think some use cases will require the user to proactively keep track of branch names (with tar_name()
) and site names, and then pass that information forward to downstream branches. Fortunately, you are already working with data frames as much as possible, and I think it would be straightforward to manually append new columns. I added a new section to the manual to discuss this issue: https://wlandau.github.io/targets-manual/dynamic.html#branch-provenance.
Update: after thinking more about your experience with the documentation, I decided to add a new "How to get started section at the top of the README. Minutes 6-40 of the video, along with the walkthrough chapter of the manual, should help most people get started in under an hour. For continued learning, the first 3 chapters of the short course have friendly hands-on practice, and then the example projects should get folks ready to start serious projects of their own. I also rearranged the sections of the README to convey a more logical progression through the material.
I also decided to delete targets/vignettes/start.Rmd
. I feel that potential future written documentation would be better off in other places: the README and reference website home page are more obvious and discoverable for small sections of prose, and the two bookdown
books are more natural homes for full articles. It's not that I disagree with vignettes in general. In fact, jagstargets
and stantargets
each have two, and they fit quite well and serve crucial roles. But for a package system as large and sprawling as targets
, it seems like the vignette system has both poor discoverability and too little room to scale, and I find it hard to bring myself to use it for this.
@wlandau, I really like your new section https://wlandau.github.io/targets/#how-to-get-started. I did steps 1. and 2. and they did help me understand targets in about 1h. While I did that I took some notes. This is not strictly a review but I figured you might find some of this useful:
-
It surprised me that the README file of targets did not show any example.
-
The chapter Walkthrough seems out of sync with the source code. It seems that the book chapter hard-codes content rather than reference it from the source code. It seems that the source code changed and the manual became outdated. In general, code chunks of the manual that use
eval = FALSE
seem oudtated, including the directory tree of the targets-minimal repo, and the use oflist()
instead oftar_pipeline()
in the file_targets.R
(which causestargets::tar_visnetwork()
to throwError : not a tar_pipeline() object. _targets.R must end with one.
). -
In the book I didn't find an "edit this file" icon. Is this intentional? (I use it a lot when I read bookdown books and want to help fix a typo).
-
tar_manifest()
andtar_meta()
return a tibble but I see them wrapped inas.data.frame()
. Is that because in that particular case you prefer the print method of vanilla dataframe? -
targets::tar_visnetwork()
warns:
Warning message:
Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead, explicitly specify either 'multisession' or 'multicore'. In the current R session, 'multiprocess' equals 'multicore'.
Thanks for the feedback, @maurolepore.
@wlandau, I really like your new section https://wlandau.github.io/targets/#how-to-get-started. I did steps 1. and 2. and they did help me understand targets in about 1h.
Glad to hear that helped.
It surprised me that the README file of targets did not show any example.
Yes, I realize this unconventional, but I find it hard to find room because there are so many external resources that need to be visible. I feel the README is the most discoverable piece of documentation, so I am trying to keep it concise and avoid drake
's catch-all approach that buried important links under a lot of volume.
The chapter Walkthrough seems out of sync with the source code. It seems that the book chapter hard-codes content rather than reference it from the source code. It seems that the source code changed and the manual became outdated. In general, code chunks of the manual that use eval = FALSE seem oudtated, including the directory tree of the targets-minimal repo,
I wanted to include more in https://github.com/wlandau/targets-minimal than I do at https://wlandau.github.io/targets-manual/walkthrough.html: more realistic workflow files, HPC, etc. So I just updated the manual to mention that the latter is an really abridged version of the former.
and the use of list() instead of tar_pipeline() in the file _targets.R (which causes targets::tar_visnetwork() to throw Error : not a tar_pipeline() object. _targets.R must end with one.).
ropensci/targets#254 is a recent update that deprecates tar_pipeline()
in favor of list()
. I feel this is a much friendlier interface.
In the book I didn't find an "edit this file" icon. Is this intentional? (I use it a lot when I read bookdown books and want to help fix a typo).
I just forgot, thanks for the reminder. Just pushed an _output.yml
file which should take care of that.
tar_manifest() and tar_meta() return a tibble but I see them wrapped in as.data.frame(). Is that because in that particular case you prefer the print method of vanilla dataframe?
For tar_manifest()
and tar_progress()
, there is no real reason, and I just pushed a fix. For tar_meta()
I used a plain data frame so users could see more columns.
targets::tar_visnetwork() warns:
Yes, it is better to keep future::plan()
commented out and to use the multisession plan instead of multiprocess. Should be fixed now in https://github.com/wlandau/targets-minimal and the cloud workspace.
I wanted to include more in https://github.com/wlandau/targets-minimal than I do at https://wlandau.github.io/targets-manual/walkthrough.html: more realistic workflow files, HPC, etc. So I just updated the manual to mention that the latter is an really abridged version of the former.
Just changed my mind on that. I simplified https://github.com/wlandau/targets-minimal so it is easier to digest and agrees with the manual.
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- Briefly describe any working relationship you have (had) with the package authors.
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
I met the package author at an R-Unconference meetup in Chicago in 2019 (?). At the event, Will taught me how to use the drake package and I closed an issue for the package. Overall, I am a two-time contributor to drake.
Documentation
The package includes all the following forms of documentation:
- A statement of need clearly stating problems the software is designed to solve and its target audience in README
- Installation instructions: for the development version of package and any non-standard dependencies in README
- Vignette(s) demonstrating major functionality that runs successfully locally
- No vignettes, but a full manual instead.
- Function Documentation: for all exported functions in R help
- Examples for all exported functions in R Help that run successfully locally
- Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL
,BugReports
andMaintainer
(which may be autogenerated viaAuthors@R
).
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software been confirmed.
- Performance: Any performance claims of the software been confirmed.
- Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine. - Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
targets: All tests pass or skip. Package check succeeds. All examples run.
tarchetypes: All tests pass. Package check succeeds. All examples run. There is a bad argument in tar_map()
's call to tar_manifest()
.
The documentation has this trick where it skips examples unless a system environment variable is set, instead of using \dontrun
or \donttest
directives. Would it be better to use these conventions?
I was able (I think) to run all the examples with the following code. It confirms that all help topics for user-facing functions have examples.
library(magrittr)
library(purrr)
library(targets)
library(tarchetypes)
Sys.setenv(TARCHETYPES_LONG_EXAMPLES = "true")
shhh_example <- purrr::quietly(example)
d <- hsearch_db("targets")
results <- d$Base$Topic %>%
lapply(
shhh_example,
character.only = TRUE,
run.dontrun = TRUE
)
results %>%
map("warnings") %>%
flatten_chr()
#> [1] "βas_pipelineβ has a help file but no examples"
#> [2] "βreexportsβ has a help file but no examples"
#> [3] "βrstudio_addin_tar_glimpseβ has a help file but no examples"
#> [4] "βrstudio_addin_tar_loadβ has a help file but no examples"
#> [5] "βrstudio_addin_tar_make_bgβ has a help file but no examples"
#> [6] "βrstudio_addin_tar_outdatedβ has a help file but no examples"
#> [7] "βrstudio_addin_tar_progressβ has a help file but no examples"
#> [8] "βrstudio_addin_tar_readβ has a help file but no examples"
#> [9] "βrstudio_addin_tar_targetβ has a help file but no examples"
#>[10] "βrstudio_addin_tar_visnetworkβ has a help file but no examples"
#>[11] "βtar_watch_app_uiβ has a help file but no examples"
#>[12] "βtarget_run_workerβ has a help file but no examples"
#>[13] "βtargets-packageβ has a help file but no examples"
d2 <- hsearch_db("tarchetypes")
results2 <- d2$Base$Topic %>%
lapply(
shhh_example,
character.only = TRUE,
run.dontrun = TRUE
)
results2 %>%
map("warnings") %>%
flatten_chr()
#> [1] "βreexportsβ has a help file but no examples"
#> [2] "βtar_force_changeβ has a help file but no examples"
#> [3] "βtar_knit_runβ has a help file but no examples"
#> [4] "βtar_render_rep_runβ has a help file but no examples"
#> [5] "βtar_render_rep_run_paramsβ has a help file but no examples"
#> [6] "βtar_render_runβ has a help file but no examples"
#> [7] "βtar_rep_runβ has a help file but no examples"
#> [8] "βtarchetypes-packageβ has a help file but no examples"
Estimated hours spent reviewing: 24
- Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.
Review Comments
First, a bit about the scope of my review. I am not able to confirm all of the functionality of this package, as I lack the experience and resources to test some of the high-performance or cloud-based features of targets. I used the package as local single-machine user. I also did not closely review the source code in the package as it is immense. My review, beyond the checkboxes above, focuses on the manual and using targets in two test cases.
Notes on the manual
Whenever I use drake, I find myself with a browser window with tabs on the manual pacakges. My test-drive of targets was no different. Therefore, I closely read over the manual. Here are notes on each chapter.
The first two chapters of the manual are clear and brief. Chapter 1 introduce build pipelines and targets. Chapter 2 walks through a minimal example.
Chapter 3 describes best practices. Here the focus abruptly shifts from describing a style of programming (away from imperative, using functions instead of scripts) to describing how to use targets. I think there are two sets of best practices. One is a change in mindset and programming style (from a sequence of files that run R code to a sequence of function calls that run data through a pipeline), and the other is a set of tips for using targets. I recall that drake had a bunch of documentation (and even functions) for how to make this conceptual shift, but some more guidance here for users might be helpful. Currently, the "clean, organized, modular, function-oriented style of programming" is mentioned but not much guidance is given about it. Just a basic sketch of how
01-data.R
02-preprocess.R
03-analysis.R
04-summaries.R
might be realized as
raw_data <- get_data(...)
data <- preprocess_data(raw_data, ...)
model <- fit_model(data, ...)
summary <- summarize_model(model, ...)
would be helpful.
The debugging section is really helpful and informative and I had missed these originally. It might make sense to promote debugging features to their own chapter.
Because Chapter 3 provides miscellaneous guidance about using targets, there should be a section on how to interface with other tools out of R. For example, how might I run a nodejs script to spellcheck or lint my markdown files produced by RMarkdown whenever they change? How might I perform some basic python task on a csv produced within the pipeline?
Chapter 4 talks about files. This section is well done. One thing I would impress upon users is that they might think of "data.csv" as a file. But it's just a string of characters. We have to use tar_target(data, "data.csv", format = "file")
to tell targets that this string names a file that needs to be tracked.
Chapters 5 and Chapters 6 cover dynamic and static branching. I go back and forth on what I would like to see here. I like that each chapter starts with a summary of when to use each kind of branching, but the basic problem that branching is meant to solve--creating several related targets at once--is not fleshed out in either spot. This is just an observation and not a request.
In both my test projects, I went with static branching because I wanted the created targets to have meaningful names. Am I missing something about how to get good names with dynamic branches? The examples all seem to use trivial examples instead of fitting multiple models or loading different datasets.
Chapter 7 covers high performance computing. I am able to use tar_make_future()
successfully. Does the default worker
value (1L
) mean that tar_make_future()
by default is not parallel? In one of my projects, I had been using furrr::future_map()
inside of a function to sweep through a large dataframe. Is there any guidance about having two layers of parallelism?
Chapter 8 covers cloud features. The details for Amazon S3 look detailed enough.
And additional thought, just about teaching or describing this package: Should targets should be explicitly described as a framework or language for describing R pipelines? Just as a ggplot2 is its own grammar of graphics and R's formula syntax describes linear models, targets provides a set of functions that provide a language for describing a build pipeline. For example, you eventually to learn about upstream and downstream targets, dynamic files, and branching.
Kicking the tires (1)
For my review, I decided to convert my website to use targets. Each post is an Rmd file and they are knitted if the .Rmd file is newer than the .md file. I have been only using timestamps but I would like to try something smarter and more portable. The build pipeline in principle is easy to describe:
- Knit a new md whenever the rmd file changes
- Spellcheck the rmd files whenever the rmd file changes
[The following comments predate the first review and Will's fixes of the issue.] Reading the package overview page, I do not see any indication for where to get started quickly. I see the manual linked as a resource for a deep dive. I see links to examples and talks. The vignettes only cover software design issues. Chapter 2 provides a walkthrough for an example analysis; here I figure out that I can use tar_script()
to create a usethis-like placeholder file. This placeholder is well commented with sections and instructions. The only thing it's missing is that the user just needs to run tar_make()
to run the build pipline. I think a Getting Started vignette would be useful, especially because getting started is pretty easy, as it turns out:
- Run
tar_script()
to create a_targets.R
file. - Edit that file. You can even use
tar_edit()
to open that file. - Run
tar_make()
.
[Okay, based on other comments, Will thinks that vignettes are not the best way to document targets. I do think that it worthwhile to highlight how easy it is to add targets to a project using the three steps above.]
I knew from drake that I would need branching to work on my files and I know that files are something drake can struggle with. Dynamic branching over files provides the steps for registering dynamic input files. The manual notes that
Dynamic branching over files is tricky. A target with
format = "file"
treats the entire set of files as an irreducible bundle. That means in order to branch over files downstream, each file must already have its own branch.
tarchetypes provides some sugar for registering file inputs with tarchetypes::tar_files_input()
. But this way imposes dynamic branching where objects come in batches with uninformative names. My list of posts, for example, shows up as:
* run branch posts_31d3b445
* run branch posts_669f285f
* run branch posts_b4b851e3
* run branch posts_ae29f28f
To get better names, I switched to static branching and now find targets with meaningful names:
v skip target in_current_.._R.2016.08.04.fixing.apa.citations.from.pandoc.Rmd
v skip target in_current_.._R.2016.03.23.setnames.dots.Rmd
v skip target in_current_.._R.2016.08.15.recent.adventures.with.lazyeval.Rmd
[...]
Drake provided a way to customize names for branched-over objects but targets lacks this feature.
With a knitting workflow in place, I decided to add automatic spellchecking to the pipeline. This is an easy addition, and tar_force()
allows me to print the spell check results only when there is a spelling mistake. π
tar_force(
spellcheck_current_results,
command = print(spellcheck_current),
force = nrow(spellcheck_current) > 0
)
Kicking the tires (2)
I also tested targets by converting an existing analysis project from drake to targets. This involved reading from a database, cleaning the data, fitting models, doing a parametric bootstrap on the model coefficients and estimating bootstrap intervals on quantiles. It was easy to mechanically convert drake statements of the form target = command
to tar_target(target, command)
. The style of drake plans is more readable and interactive (you can highlight part of a plan and run it to debug a target). tarchetypes provides a tar_plan()
for recreating that style but this review I stuck with the vanilla style.
Working on regular lists (instead of some tar_list()
or some specialized function) made it easy for me to break up my build pipeline with a series of target lists. First, targets_data = list(...)
, then targets_models = list(...)
, then ending the file with list(targets_data, target_models)
.
Again, I preferred static branching because I had a small number of models and wanted predictable names. The porting was straightforward, although a couple of times I pasted a map()
from a drake target into a tar_target(..., pattern)
triggering dynamic branching instead of dynamic branching.
Some themes from testing
- I expect that files will be a hurdle for learners because of the need to target-ify names to track and target-file-ify files named by both names.
- Static branching is better at creating a small number of targets or targets where we are interested the target names, but dynamic branching is introduced first. I expect users to experiment with dynamic branching but run into issues with the names.
- I expect drake issues to trip over themselves by thinking that they know how this package works π .
Comments as a drake user
I first tried targets in summer 2020 to overcome a strange issue in drake. On Windows, filenames have a maximum length. drake's data-store produces files with very long names that would sometimes not work on Windows. I did not want to mess with the data-store or hashing configuration, so I tried targets and the problem was solved. This change in data storage is documented in the package's statement of need under "lighter, friendlier data management". In this case, switching to targets provided me with an immediate win.
- Switching to stricter conventions, like assuming a
_targets.R
file is a huge win for getting started and for studying other target-based projects. tar_target(name, path)
andtar_target(file, name, format = "file")
is better than having implicit file targets viafile_in()
orfile_out()
. It never sat well with having to dofile_in(!! variable)
for file names stored in variables.
Reply to @tjmahr's review
Thank you so much for your review, TJ. You raised some really interesting points, and I enjoyed responding.
There is a bad argument in tar_map()'s call to tar_manifest().
Fixed in ropensci/tarchetypes@01660c1.
The documentation has this trick where it skips examples unless a system environment variable is set, instead of using \dontrun or \donttest directives. Would it be better to use these conventions?
I definitely would prefer \dontrun{}
and \donttest{}
, but CRAN now routinely rejects packages that frequently use these directives. I just went through this with rfacts
late last year. CRAN rejected \dontrun()
but accepted this workaround instead.
Estimated hours spent reviewing: 24
Knowing what you know now, would it have made sense to split up this monolithic submission into 3 different ones? As I mentioned to @limnoliver, although I still view the 3 (now 4) repositories as one unified product, I am still concerned I judged poorly when I plowed ahead with everything at once.
Chapter 3 describes best practices. Here the focus abruptly shifts from describing a style of programming (away from imperative, using functions instead of scripts) to describing how to use targets. I think there are two sets of best practices. One is a change in mindset and programming style (from a sequence of files that run R code to a sequence of function calls that run data through a pipeline), and the other is a set of tips for using targets. I recall that drake had a bunch of documentation (and even functions) for how to make this conceptual shift, but some more guidance here for users might be helpful. Currently, the "clean, organized, modular, function-oriented style of programming" is mentioned but not much guidance is given about it.
In ropensci-books/targets@986f71d, I add a new chapter on functions (based the existing walkthrough chapter and drake
's functions chapter).
The debugging section is really helpful and informative and I had missed these originally. It might make sense to promote debugging features to their own chapter.
Thanks. In ropensci-books/targets@986f71d, I moved the debugging section to its own chapter.
Because Chapter 3 provides miscellaneous guidance about using targets, there should be a section on how to interface with other tools out of R. For example, how might I run a nodejs script to spellcheck or lint my markdown files produced by RMarkdown whenever they change? How might I perform some basic python task on a csv produced within the pipeline?
Now discussed in ropensci-books/targets@6f5c3af.
One thing I would impress upon users is that they might think of "data.csv" as a file. But it's just a string of characters. We have to use tar_target(data, "data.csv", format = "file") to tell targets that this string names a file that needs to be tracked.
I tried to make this clearer in ropensci-books/targets@92200ce.
Chapters 5 and Chapters 6 cover dynamic and static branching. I go back and forth on what I would like to see here. I like that each chapter starts with a summary of when to use each kind of branching, but the basic problem that branching is meant to solve--creating several related targets at once--is not fleshed out in either spot. This is just an observation and not a request.
I took that for granted. As of ropensci-books/targets@b765470, the dynamic branching chapter starts off by motivating the idea of branching in general..
In both my test projects, I went with static branching because I wanted the created targets to have meaningful names. Am I missing something about how to get good names with dynamic branches?
Sounds like static branching fits your use cases much better, and that's totally fine. As I explain in ropensci-books/targets@b765470, different kinds of branching fit different preferences.
There are many real-world scenarios where dynamic branching wins out: for example, the USGS use case @limnoliver described in her review using tar_group()
to branch over a dataset. In that example, we don't even know what branches we are going to need (or even how many) until tar_make()
reaches that point in the pipeline.
Personally, I use dynamic branching in the life sciences for simulating thousands of reps of clinical trials, as I describe in the second half of my R/Pharma 2020 talk (slides, source). And I routinely use dynamic branching for simulation-based calibration to validate Bayesian models, which motivated examples here and here.
As you saw, both kinds of branching are difficult to learn. One of my major goals with the R Targetopia is to completely abstract it away so users do not need to think about it for the most common domains of data science. For example, jagstargets
and stantargets
use both forms of branching, but only behind the scenes.
Now for names. With dynamic branching, it is just not feasible to generate friendly names. A common first assumption is that if target x
branches over target y
and y
has the value c("a", "b", "c")
, then the branch names of x
should be x_a
, x_b
, and x_c
. But what if y
is a grouped data frame? Or a list of Keras models? There is no fully automated way to get unique reproducible friendly informative names from any kind of arbitrary object. The remaining choices are:
- Use a simple indexing scheme like 1, 2, 3, ..., or
- Construct the suffixes from the hashes of the input values.
(1) does not work for targets
because it invalidates too may branches. To demonstrate, let's say x
has branches x_1
, x_2
, and x_3
corresponding to inputs in y <- c("a", "b", "c")
. But what if the user changes the input vector to y <- c("a", "c", "b")
? That would require us to switch x_2
and x_3
and thus invalidate both branches. We have similar situations with insertion and deletion, and it is just not worth trying to manually fight all that.
(2) is the only option I see because hashes are reproducible and the order does not matter. Unfortunately, hashes are unappealing to look at, but all I can do is choose a short hash function (xxhash32).
The examples all seem to use trivial examples instead of fitting multiple models or loading different datasets.
I often find it difficult to cover features this complicated while expressing them through a sufficiently motivating yet pedagogically simple example. Hopefully example projects like https://github.com/wlandau/targets-stan make up for some of this.
Chapter 7 covers high performance computing. I am able to use tar_make_future() successfully. Does the default worker value (1L) mean that tar_make_future() by default is not parallel?
tar_make_future(workers = 1L)
spawns one remote worker and farms everything out to it. So yes, it's not parallel, but it's not the same as running locally either. But I still believe a default of workers = 1L
is good practice because it consumes as few resources as possible, and I do not want to make assumptions about what other users' computers can handle. And believe it or not, tar_make_future(workers = 1L)
is sometimes exactly what you want. Last summer, I used it to submit one really slow Bayesian model to a cluster because I knew the compute node would be a lot more powerful than the login node.
Is there any guidance about having two layers of parallelism?
I mostly defer to packages future
and clustermq
on that one. Users of targets
need to be reasonably familiar with each because the configuration goes in _targets.R
(e.g. future::plan()
). The future
README describes nested futures, and the same principles apply to targets
. In the two-level nesting case, the outer future controls parallelism among targets, and the inner future controls parallelism within targets.
In traditional HPC settings, nested parallelism is most useful when you farm out different targets to different nodes on a cluster and each target uses multiple local cores on its node. The current HPC chapter walks through how to do that for both tar_make_clustermq()
and tar_make_future()
.
And additional thought, just about teaching or describing this package: Should targets should be explicitly described as a framework or language for describing R pipelines? Just as a ggplot2 is its own grammar of graphics and R's formula syntax describes linear models, targets provides a set of functions that provide a language for describing a build pipeline. For example, you eventually to learn about upstream and downstream targets, dynamic files, and branching.
I think of targets
as more of a framework than a language, and I am on the fence about whether most users actually need to know that. The framework perspective is most relevant for package developers who write target factories in R Targetopia packages like jagstargets
and stantargets
. The reason why targets
succeeds as a framework is precisely because it is not a language: i.e., it does not force users into a DSL for pipeline construction. drake
is the opposite: drake_plan()
is a DSL with maximum user convenience and minimum developer convenience. (It is really hard to build drake
plan factories.) That is why we have no drake
-topia.
tar_script() to create a usethis-like placeholder file. This placeholder is well commented with sections and instructions. The only thing it's missing is that the user just needs to run tar_make() to run the build pipline.
I recently added comments to that placeholder file to encourage users to try out tar_make() and tar_read().
I think a Getting Started vignette would be useful...[Okay, based on other comments, Will thinks that vignettes are not the best way to document targets. I do think that it worthwhile to highlight how easy it is to add targets to a project using the three steps above.]
I am trying to use the walkthorugh chapter as a "get started" vignette, and it is now linked from the "how to get started" section of the README. I also added these comments about tar_script() and tar_edit() to that chapter:
Whereas files raw_data.csv and functions.R are typical user-defined components of a project-oriented workflow, _targets.R file is special. Every targets workflow needs a file called _targets.R in the projectβs root directory. Functions tar_script() and tar_edit() can help you create one.
But to really get a handle on _targets.R
, I think people will need hands-on practice. That's why I linked to the targets-tutorial short course as part of the "how to get started" section of the README. The short course has lots of practice using tar_edit() and modifying _targets.R
. And I tried to make the course as accessible as possible with this RStudio Cloud workspace.
I knew from drake that I would need branching to work on my files and I know that files are something drake can struggle with. Dynamic branching over files provides the steps for registering dynamic input files. The manual notes that
For files, static branching is much easier, and it is free of all the perils identified in that section of the manual. For your use cases, I would stick with static-only branching.
Drake provided a way to customize names for branched-over objects but targets lacks this feature.
tar_map()
actually does have a names
argument that lets you select elements from values
to construct the names. If you want custom names, simply add another element to values
and then identify it in names
. Example:
library(targets)
tar_script({
library(tarchetypes)
tar_option_set(packages = "rmarkdown")
tar_map(
values = list(
slug1 = c("file", "blog"),
slug2 = c("new", "old"),
file = c("fixing.apa.citations.from.pandoc.Rmd", "recent.adventures.with.lazyeval.Rmd")
),
names = starts_with("slug"), # tidyselect is optional
tar_target(out, render(file))
)
})
tar_manifest()
#> # A tibble: 2 x 3
#> name command pattern
#> <chr> <chr> <chr>
#> 1 out_blog_old "render(\"recent.adventures.with.lazyeval.Rmd\")" <NA>
#> 2 out_file_new "render(\"fixing.apa.citations.from.pandoc.Rmd\")" <NA>
Created on 2021-01-07 by the reprex package (v0.3.0)
tar_eval()
may be an easier way to do this. It has fewer guardrails but more flexibility.
library(targets)
tar_script({
library(tarchetypes)
tar_option_set(packages = "rmarkdown")
tar_eval(
values = list(
name = c("file_new", "blog_old"),
file = c("fixing.apa.citations.from.pandoc.Rmd", "recent.adventures.with.lazyeval.Rmd")
),
tar_target(name, render(file))
)
})
tar_manifest()
#> # A tibble: 2 x 3
#> name command pattern
#> <chr> <chr> <chr>
#> 1 blog_old "render(\"recent.adventures.with.lazyeval.Rmd\")" <NA>
#> 2 file_new "render(\"fixing.apa.citations.from.pandoc.Rmd\")" <NA>
Created on 2021-01-07 by the reprex package (v0.3.0)
With a knitting workflow in place, I decided to add automatic spellchecking to the pipeline. This is an easy addition, and tar_force() allows me to print the spell check results only when there is a spelling mistake.
Glad to hear that was useful. triggers
in drake
were a struggle because I implemented them at a low level. This time around, tarchetypes
can fortunately handle all that at the interface level (e.g. target factories like tar_force()
).
Again, I preferred static branching because I had a small number of models and wanted predictable names. The porting was straightforward, although a couple of times I pasted a map() from a drake target into a tar_target(..., pattern) triggering dynamic branching instead of dynamic branching.
Some people have proposed an automatic drake-to-targets converter, which I do not think is possible because of the advanced features. (We went through this same thing during the shift from remake to drake, but the converter never got developed.) I try to make up for this in the drake chapter of the manual.
I expect that files will be a hurdle for learners because of the need to target-ify names to track and target-file-ify files named by both names.
As I mentioned above, there is an extra step for names in tar_map()
.
Static branching is better at creating a small number of targets or targets where we are interested the target names, but dynamic branching is introduced first. I expect users to experiment with dynamic branching but run into issues with the names.
Conceptually, static branching is an outer layer on top of dynamic branching, and the former operates on the latter during dynamic-within-static. That is why I put the dynamic branching chapter first. dynamic-within-static is really a static branching technique, but it assumes knowledge of dynamic branching.
The top of the dynamic branching chapter now states that with the exception of dynamic-within-static, those chapters can be read in any order.
And I believe that completes my first round of replies. Please let me know if I missed anything.
And I believe that completes my first round of replies. Please let me know if I missed anything.
-- @wlandau
Upon changes being made, change the review status tag to 5/awaiting-reviewer-response, and request that reviewers indicate approval with the reviewer approval template.
-- Section 8.1.3 of https://devguide.ropensci.org/editorguide.html
--
@limnoliver and @tjmahr,
Thanks a lot for your amazing reviews. I understand @wlandau feels the changes he made address all reviewer's comments. Still I see one reviewer box unchecked: "Vignette(s) demonstrating major functionality that runs successfully locally".
@wlandau explained why he excluded vignettes. But such exclusion rose the reviewer's attention and fails to meet rOpenSci's guidelines. As a fresh editor, I feel uncomfortable accepting the exclusion just now and I propose:
- @limnoliver and @tjmahr, please (1) share your final opinion about including/excluding vignettes, and (2) indicate if -- following the latest changes -- you (2.a) request more changes or (2.b) approve.
- @wlandau please let me know if you prefer to (1) meet this guideline or (2) engage the chief editor. I'm totally happy either way.
Looks like vignettes are a more strict requirement than I originally assumed. I still struggle to use the vignette system for this, but I gave it another try today. The new overview vignette has a summary of targets
as a whole and a short synopsis of each chapter in the user manual. The goal is to condense down that lengthy book and give people a high-level view of all the major features. This way, we have a vignette with a purpose, but we do not introduce new material that should be in the manual or the README.
Thanks @wlandau. I'm reaching out to check how strict the guideline is. I value critical thinking over box-ticking but I need to pick more brains. Thanks for your patience.
@wlandau, following a discussion with other editors I am now think your new vignette "Overview" does meet the guidelines. I believe the guideline's goal is to ensure the package has enough documentation and your package does (via the manual). Yet this vignette "Overview" is useful because of what you wrote above and because some users may skip the website; the vignette can then point users to the website, ensuring they don't miss the great documentation of the package.
I am now waiting for @limnoliver and @tjmahr to say if they want to either (a) request more changes or (b) approve.
I approve.
Before I forget: @limnoliver, @tjmahr, and @maurolepore, would you like to be listed as "reviewers" in the DESCRIPTION
files of the repos under review? If so, would you like me to include your ORCIDs?
It is amazing how much better the documentation has become as a result of your input. I am more optimistic now that new users will find targets
accessible. Even though I feel like I already went down this road with drake
, there is always much more to learn.
I approve as well. I like the additions of the "Overview" vignette and the "Getting Started" section of the README. Overall, I think the new users of targets will have a much easier time navigating these resources now and getting jump started.
Fantastic, thank you so much @limnoliver and @tjmahr!
Approved! Thanks @wlandau for submitting and @limnoliver and @tjmahr for your reviews! π₯
To-dos:
- Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
- Fix all links to the GitHub repo to point to the repo under the ropensci organization.
- Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
- If you already had a
pkgdown
website and are ok relying only on rOpenSci central docs building and branding,- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current
pkgdown
website with a redirecting page - replace your package docs URL with
https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the
URL
field alongside the link to the GitHub repository, e.g.:URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
- Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname)
. If Appveyor does not pick up new commits after transfer, you might need to delete and re-create the Appveyor project. (Repo transfers are smoother with Travis CI and GitHub Actions) - We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
DONE
- Create a two-person team in rOpenSciβs βropensciβ GitHub organization, named for the package, with yourself and the package author as members.
- Go to the repository settings in rOpenSciβs βropensciβ GitHub organization and give the author βAdminβ access to the repository.
- If authors maintain a gitbook that is at least partly about their package, contact an rOpenSci staff member so they might contact the authors about transfer to the ropensci-books GitHub organisation.
**
NOMINATION
- Nominate a package to be featured in an rOpenSci blog post or tech note if you think it might be of high interest. Please note in the software review issue one or two things the author could highlight, and tag @ropensci/blog-editors
@ropensci/blog-editors, I think these packages might be of high interest. I suspect many readers will be familiar with the drake package and may wonder how it compares to targets and how to migrate; they may also be interested in the relationship with tarchetypes.
Agreed. @wlandau we must have a post about this π. We're lucky to have this in the rOpenSci organization.
Guidelines: https://blogguide.ropensci.org/. When you're ready Will, please suggest a date to submit a draft post and my colleague @steffilazerte will review it.
Thanks, I would love to submit a post! I did write a draft, but much has happened since then, and it needs a lot of work. I will submit a PR when it is ready.
My apologies @tjmahr and @limnoliver, somehow I managed to miss this from your reviews:
[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.
I will add you both.
"rev"
roles now added.
@wlandau I've sent you an invite to the ropensci-books organization so you might transfer your book repo there. Thank you!
Thanks so much, @maelle! As I discussed with @maurolepore, I transferred wlandau/targets-manual
and wlandau/targets-design
to ropensci-books
.
Just one last thing: would you grant me access to the settings of both ropensci-books/targets-manual
and ropensci-books/targets-design
? I would like to rename ropensci-books/targets-manual
to the more concise ropensci-books/targets
.
Amazing, thank you! And the URLs are working!
@maurolepore, I believe I addressed the items in #401 (comment) (except the URL redirects, which are currently not working for the ropensci-books
repos). Please let me know if there is anything I missed.
I just submitted targets
to CRAN, and I will submit tarchetypes
as soon as targets
is accepted and the binaries are available. My colleagues and I look forward to the upcoming cascade of internal production releases that will soon follow.
In the near future, I would like to post an rOpenSci submission for jagstargets
, an R Targetopia package for Bayesian data analysis with JAGS. jagstargets
should be a nice warmup for stantargets
, a similar but larger package built on targets
and Stan.
This review process was incredibly rewarding, and targets
is much more accessible as a result. I cannot thank you enough!
@maurolepore, I believe I addressed the items in #401 (comment) (except the URL redirects, which are currently not working for the ropensci-books repos).
Thanks! I'm sure you'll keep on top of it. Let me know on Slack if you need help.
Following https://devguide.ropensci.org/editorguide.html I added a "peer-reviewed" topic to targets and tarchetypes, and I'm now closing this software-review issue.
Thank you to our amazing reviewers @limnoliver and @tjmahr and amazing author @wlandau. I look forward to seeing the R community thrive with these new packages. Best luck with the submission to CRAN.
Fantastic, @maurolepore! Thank you so much for your conscientiousness and diligence during this whole process.
An update: the redirects and 404 pages are now working. The remaining issues are purely cosmetic and probably have to do with how R Markdown sites get rendered: https://community.rstudio.com/t/trouble-customizing-404-page-of-r-markdown-website/93142.