ropensci/unconf15

Packages as research repositories/compendia

benmarwick opened this issue ยท 11 comments

At last year's rOpenSci event we worked on a short guide to reproducible research, under @iamciera's guidance. Some of the most interesting progress on this topic since then has been on using the R package framework as a research repository or compendium for scholarly work, cf. @rmflight's blog posts, @cboettig's template package, @Pakillo's template package and @jhollist's manuscriptPackage, etc.

The concept of a research compendium has been around for a while (cf. Gentleman 2005, Gentleman & Temple Lang 2007, Stodden 2009, Leisch et al. 2011). Many of us are making custom R packages to accompany our research publications to improve reproducibility, but I think there are a bunch of questions are what are the best ways to do this.

Perhaps at the unconf we can have a discussion to share some of the ways we're using R packages as research compendia, and draft a few guidelines to add to the guide. The goal would be to help domain scientists, especially those who are primarily not tool-developers and already prolific package authors, get started with this. @hadley's book is of course an excellent resource on R packages generally, but using packages as research compendia raises some specialised questions that this ropensci group are uniquely qualified to tackle.

Some of the questions that I'd like to learn more about on this topic include:

  • How best to include data in the package? Or link to data when it's too big to go in the package. Rdata files may be more efficient, but plain text formats are more accessible for reuse in other contexts
  • How best to include the manuscript in the package? The package vignette seems like the obvious choice, but there are some limitations to that, for example, I cannot store the HTML output from the rendered Rmd in there. @cboettig's solution is to have a manuscript directory in the package, which is outside of the regular package framework and needs make to execute.
  • How best to control dependencies on other packages? Should we specify exact versions? Bundle the source of other packages with our package to maximize isolation and protect against changes in other packages that will break ours? Which of the numerous current potential solutions to this problem has the most promise (packrat, rbundler, checkpoint, gRAN, drat, etc.)? Perhaps these questions are a subset of #7
  • How to address dependencies external to R when presenting a package as a stand-alone research repository? A docker image containing the package is one option some of us have pursued (cf. cboettig/nonparametric-bayes#55)

Not surprisingly I'm also interested in this. To add the list:

  • When is a simple R script / Rmd sufficient and when (if ever?) should we be using an R package for this purpose (vs, say, creating the R package by independently if there's some complicated set of functions to be implemented, e.g. as pomp is to this recent paper & it's R script supplement: http://doi.org/10.1073/pnas.1410597112 ).
  • If we just go the R-script route, what's a quick and easy way to deal with the kind of metadata / dependency / data / code documentation stuff we get built in when we use R packages?
  • What about the infrastructure beyond the package structure itself? e.g. archiving via Zenodo/figshare, use of Github, use of Travis-CI for continuous integration of a research paper, deploying artifacts from CI, simple metadata for scripts (zenodo.json, https://github.com/mbjones/codemeta maybe)
  • What's the role (if any) of make in this picture? maker? (Trying to draw @richfitz in here to set me right)
  • Do we really want manuscripts to be .Rmd files instead of R scripts? Scientific papers don't always flow like a vignette with blocks of text followed by the code that implements whatever was just described, so it's not just a matter of includes=FALSE. (e.g. methods sections may come after figures).
  • Handling all those pesky components of a manuscript: bibfiles, external chunks, tex templates, csl and cls files, possibly the packrat files, output pdf and tex formats etc. This is another reason I haven't found it practical to treat a real manuscript as a vignette. Perhaps this is already solved but the answer isn't obvious to me. Certainly rmarkdown, rticles etc have made it a bit easier but this still all tends to look a lot cleaner in toy examples than in my real life.

๐Ÿ‘ seems like a good idea, possibly also linking with #6 as (a) what is a manuscript if not an artefact of research? and (b) how do you store outputs with the compendium?

I won't be at that unconf (will be following along remotely), but wanted to add my support to the idea of discussing packages as research compendium. Couple of observations:

  • Output is an issue that I have found to be tricky. PDF's formatted by latex look great, but not always practical, especially for co-authors, reviewers etc. I haven't found a really good (i.e. easy) way to have the output formatted in multiple ways (pdf and word) for different audiences. This is a usability question more that "what makes up a compendium" but enough of a headache in my experience that I think it should warrant some discussion.
  • I have only done a single paper (still in internal review) as an R package/vignette but the issue @cboettig raises regarding the flow of a paper vs a vignette is real. The main problem I had was with the abstract (i moved all figures and tables to the end). I wanted to refer to results in the abstract that weren't discussed until later in the paper. I ended up moving most of the analysis into a few chunks directly after the YAML and before the abstract in the .Rmd.
  • Dealing with analyses that take time to run and/or have sizable output also caused some problems. The paper we have now that uses the manuscriptPackage template includes random forest modelling. The analysis takes a couple of days to run and the cached results are biggish. As such, I don't want that analysis to run every time the vignette has to build, but the cached objects take time to download and install so some work arounds are required.. What we have is modest and with more sizable projects with significant runtime this would break the package as compendium idea, I'd think. In short, any solution would have to work (almost) equally as well for small/quick as it does for big/slow.

I may sound like I am down on the idea of using packages, but I am not. I have found a lot of benefit in using the package format and specifically using the vignette as manuscript. I will use the model again and anything that comes out of this discussion would be great to include for my next manuscript.

I think there is some good discussion to be had as to whether the goal of the reproducibility charge is the end-to-end publication target (including the issues pointed out above w/ citation management) or the generation of publication components that are data/code/methods related. This is a topic that I am very interested in, and some things that we have been working on are more geared towards including R packaging (or something of the like) in larger collaborations as both the analytical tools and the product component (figs/tables) building as part of a project-level CI. I think there is much to be done in terms of steering the research process towards reproducibility, and it is going to become more important as data/questions increase in complexity and the teams continue to grow and diversify.

@cboettig the dynamic documents vs scientific narratives is a tough one. I did some work on non-linear dynamic documents, where a narrative is a path through the graph of document elements for my thesis. See, eg https://github.com/gmbecker/DynDocModel (hoping to find time to bump this back up to the /back/ burner ...). Things get very complicated very quickly, though.

Something akin to the Vistrails approach http://www.vistrails.org/index.php/Main_Page#Publishing_Reproducible_Results , with a database of code and artifacts that a dynamic/"live" paper pulls from/recomputes at compile or view time might be more useful in practice. At least in the short term.

A modification of Gavish and Donoho's proposed VCRs http://www.sciencedirect.com/science/article/pii/S1877050911001256 is another possiblity, though AFAIR they they call only for verification, not dynamic reproduction.

Have added the method bundle_repo to git2r that might be useful in this context. It clones the package repository as a bare repo to inst/pkg.git so that when the package is installed the repo can be accessed with repo <- repository(system.file("pkg.git", package = "pkg")). I'm also planning to add the argument session (FALSE/TRUE) to the commit and tag methods to append the sessionInfo to the commit/tag message.

One suggestion for tracking provenance from @metamattj is the recordr package https://github.com/NCEAS/recordr/

To follow up a bit on this, one of the outcomes of the 2015 unconf discussion was this essay:

https://github.com/ropensci/rrrpkg

And we expanded that into this pre-print:

https://peerj.com/preprints/3192/

Which will shortly appear in The American Statistician in a collection of papers on 'Practical Data Science for Stats'

Awesome! And thanks for posting the follow up here.

And just to add an idea for more work :) would you be interested in a blog post on this to cross-post on the rOpenSci and Software/Data Carpentry blog, or just put on one? I imagine @stefaniebutland on the rOpenSci and @weaverbel on SWC/DC could help.

@tracykteal Is this post on an unconf17 project relevant here?
Tackling the Research Compendium at runconf17 https://ropensci.org/blog/blog/2017/06/20/checkers