The R package as the unit of reproducible research

Question

The R package as the unit of reproducible research

gaborcsardi opened this issue 10 years ago · 15 comments

In my opinion a piece of reproducible research needs at least the following ingredients:

Documents, to write down goals, reasons for decisions, subproject summaries, why some directions were abandoned, etc. Research papers belong here, too.
Code, to do the computation.
Potentially, documentation for the code.
Tests for your code. (Well, ideally.)
Data.
Code or data by other people (or code/data from your previous project), and a way to specify what this code/data are and which versions you need.
Scripts, that use your or external code, and generate data or documents.
Other misc files, maybe.
A way of documenting the environment external to the package, to make the whole thing reproducible. (I.e. OS version, system libraries, etc.)
A way to keep track of all the things above, how and when they change over time.
A way of sharing all the things above with collaborators.

If you think about it, this is more or less the description of an R package in a git repository:

Use vignettes for documents, put them in /vignettes
Just put the R/C/C++ code in the package. Other code is trickier, but you can put in /inst.
Write roxygen or plain Rd docs, goes into /man.
Use testthat (or another testing package, or the builtin R package testing methods), goes into /tests
Just put it in /data. (OK, not that simple, because data can be big, so maybe into another data package, or a database.)
If these other things are also R packages, then just declare your Imports on them in DESCRIPTION.
Again, /vignettes.
They go in /inst.
Use Docker if you can. If you can't (e.g. I can't use Docker to run my stuff on the university cluster), then declare your R version and SystemDependencies in DESCRIPTION.
Put the package into a git repository.
Put the git repository on Github, Bitbucket, etc.

So do we have everything to use R packages as research units? Probably not, but we are really close, I think. In my opinion we would need:

A better way of handling (versioned) dependencies, for R packages not on CRAN as well (see #7).
A better way of describing the platform, so that it is possible to recreate it without much effort. Docker is great for this, but sometimes you just cannot use Docker.
Tools, that facilitate the whole process. devtools is great for developing packages in general. Some more specialized tools that go further towards this particular use of packages would be nice.

If this does make sense to you (or the opposite :), I'll be happy to chat about it more.

Answer 1 · 2015-03-19T18:06:22.000Z

Great issue and well summarized! Just to note, this is clearly very closely tied to #11 and probably #6 as well. #11 includes a list of challenges various of us have encountered when trying this.

Having tried to practice this for the past five years, I find some of the biggest challenges are as much conceptual as infrastructure. This only gets more difficult when new work builds in existing work, or when one investigation branches into separate ones. When is an idea/line of investigation ready to be a package? When to start a new package vs continue with an existing one? Should I branch an existing repo to explore an new direction? Is it 1 package: 1 paper the ideal? I have plenty of examples of these variations in my own Github account and would love to chat about some of this decision tree on some concrete examples if anyone is interested.

Meanwhile, the tooling has definitely gotten better. Just a quick comment on your 5. "The data is too big": Use devtools::use_raw_data() to add scripts that pull the data from an appropriate data repository and tidy it up. Even though the notion of a top-level raw_data dir does not fit in the R package definition, it's great to have something like devtools be able to both recognize the need for this and more importantly provide something of a standard convention. Until now I have had to come up with custom solutions for things like this, which don't benefit from either the common tooling or being a recognizable pattern to anyone but me. So cheers to @hadley et al. Maybe we can identify other places where something similar is needed.

Answer 2 · 2015-03-19T18:07:51.000Z

If this does make sense to you (or the opposite :), I'll be happy to chat about it more.

Yes, I'd love to set some time aside to chat through this.

Answer 3 · 2015-03-19T18:23:32.000Z

I will confess to being skeptical that an R package = the natural unit of reproducible research. I'm talking about packaging and documenting a specific data analysis, such as for a publication. The main goal of a package is provide functions behind for reuse in diverse contexts. The main goal of an analysis is to turn a set of inputs into a set of outputs. As far as re-purposing existing tools, I find make all more useful than R CMD build. (And I find them both rather awkward for this task!) I totally agree this is an interesting and worthwhile discussion.

Answer 4 · 2015-03-19T18:34:49.000Z

Re: @gaborcsardi's proposal … we'd have to really give vignettes more love in the workflow/tools.

Answer 5 · 2015-03-19T18:38:05.000Z

@jennybc I agree that it is not natural, but maybe with some tools we can make it (more) natural. This is exactly what I wanted to discuss. A good R-based build system is such a tool, for example. Something along the lines of https://github.com/richfitz/remake or the Grunt JavaScript project.

My premise is that R packages do provide a lot of things that you need for a "reproducible research project". Let's discuss what is missing, to see if it is reasonable to create it.

@cboettig I completely agree that conceptual challenges are at least as big. These will never go away entirely, but good tools can help directing researchers towards good practices.

Answer 6 · 2015-03-19T18:46:44.000Z

I do love the idea of an R-based build system and the Make-like aspects of remake are very cool. But it also illustrates how working within the R package system encourages (forces?) you to "function all the things". You will have to pry my scripts from my cold dead hands. 😁

I should learn something about Grunt ….

Answer 7 · 2015-03-19T18:50:13.000Z

Thanks for this discussion @gaborcsardi

One missing piece here is automating documentation for data. @cboettig EML package is a good eg of a tool that can help automate dataset curation in R. metadata for data is the kind of thing most people forget about but makes data reuse so much easier, and at least EML can create granular machine readable metadata files to make dataset submission to online repos easy

Answer 8 · 2015-03-19T18:51:45.000Z

Agree with @sckott , but also if the data are too large and are potentially dynamic, pointing to DOI'd snapshots (a la dataOne?) is an important thing to consider too.

Answer 9 · 2015-03-19T18:53:27.000Z

@jread-usgs In the future we should also be able to just do this with Dat. Such that the package contains a dat remote and appropriate metadata. Then a user can just dat clone when retrieving the raw data. This is planned functionality for rDat.

Answer 10 · 2015-03-19T18:55:28.000Z

@karthik excited for that to be a reality. cool.

Answer 11 · 2015-03-19T18:59:59.000Z

All good ideas. Quick comments:

I like dat very much, and of course we can have "plugins" (small R packages, really), that add various data sources, like DOIs.
If small data is polymorphic, that's fine, put it in the R package (or a separate R package by itself, and put that R package into git). So this could be a plugin, too. Obviously, less functional than dat, but sometimes this is enough.
You can document data in an R package. I understand, that this is not always flexible enough, and then you can just use an EML(-like) plugin.
I like scripts, and definitely want to keep them in the workflow. :)

Answer 12 · 2015-03-21T14:41:04.000Z

This topic/thread has sparked an interesting discussion in our office, and I wanted to bring up a point that @lawinslow made that I had missed: Often the point of packages is to abstract elements of the data processing, which is in direct conflict with the concepts of reproducible research. Of course the guts would be available in the source, but the points of emphasis may be at odds for the two concepts. The discussion about scripts vs functions that @jennybc brought up are also supporting this need. Just another thing to consider as part of the high-level discussion "R packages as reproducible research".

Answer 13 · 2015-03-21T15:25:09.000Z

@jread-usgs I was probably unclear about a lot of things. I didn't mean to say that all R packages are research.

I agree that packages are abstractions, in a sense all of programming is about making abstractions. For the concerns of the research, at any given stage (i.e. at a git commit) of the project these abstractions correspond to some specific implementations. (At least if you want to execute them, you need implementations.) And that is all you need to make the research reproducible.

As for scripts, I think they are fine. We can just put them in inst/scripts and then have our tools handle them appropriately. E.g. the "make system" should be able to run scripts, etc.

Answer 14 · 2015-03-26T19:06:23.000Z

I made some notes and put them up at https://docs.google.com/document/d/1EnQzDe1gp-j_bdg8WXZjFr9mFk2-DutgJxNZhRZQ3bU/edit?usp=sharing

Answer 15 · 2015-03-27T23:08:00.000Z

Hi all. I've just created the repository we discussed yesterday. The README can host our notes from yesterday and developing thoughts:

https://github.com/ropensci/rrrpkg

I'm about to take @hadley's notes and dump them in. I may also take a pass through, in case I can add anything. Please feel free to add more via PR or ask me if you want to be a collaborator.