/2019-05-14_reproducibility

🏫 Materials for talk/demonstration on reproducibility in R, given May 14, 2019 at the Utah County RUG

Primary LanguageR

“Why won’t this run again?!”: Making R analysis more reproducible

Andrew Heiss, PhD • Brigham Young University
May 14, 2019 • Utah County R Users Group



Slides

Download the slides from today’s talk

General resources for reproducibility

Example methods

Examples of each of these are included in this repository. Click on the big green “Clone or download” button at the top of the GitHub page and download the .zip file to follow along.

Code only

All code included in a single fairly well-commented file. Data is not included; must either be downloaded separately or obtained from authors.

Real life examples:

  • Most R code out in the wild :)

How to use:

  • Download and open analysis.R
  • Try to run it
  • Install missing packages as needed
  • Hope packages are the correct version
  • Track down data if/when missing

Code and data

All code is included in multiple well-commented files. Data is included in data/. Folder is structured as an RStudio project.

Real life examples:

How to use:

  • Download directory
  • Open 02_code-data.Rproj to open a new RStudio instance
  • Run process-data.R, then figure-1.R, then models_figure-2.R (following the instructions in the project’s README file)
  • Install missing packages as needed (and hope they’re the right version)

Code and data + Makefile

Same as the previous example, except now everything is automated with a Makefile that runs the three R scripts in the correct order.

Real life examples:

Writing Makefiles goes beyond the scope of this little demonstration, but Karl Broman has some excellent resources and tutorials about how to use them.

How to use:

  • Make sure you have access to GNU make. On macOS open Terminal (found in /Applications/Utilities/) and run xcode-select --install. If you use Windows, check out this Stat 545 page
  • Download directory
  • Open 03_code-data_makefile.Rproj to open a new RStudio instance
  • Open the terminal panel in RStudio and type make output (go to Tools > Terminal > New terminal if you don’t have a terminal panel available already)
  • Install missing packages as needed (and hope they’re the right version)

R Markdown report

Here we use a single R Markdown file to conduct the analysis. This literate programming approach lets you mix prose and code and creates a notebook for your analysis.

Real life examples:

How to use:

  • Downlaod directory
  • Open 04_Rmd-report.Rproj to open a new RStudio instance
  • Open provo-weather.Rmd and click on the “Knit” button near the top of the source editor; wait for R to generate an HTML file
  • Install missing packages as needed (and hope they’re the right version)

R Markdown website

Here we use R Markdown’s built-in website capabilities to generate a static website from a collection of .Rmd files. This allows you to have a more complicated notebook with subpages that you can upload anywhere online (your own private server, GitHub pages, etc.), or keep locally on your computer.

See the R Markdown Websites documentation for complete details of this approach. Here’s the tl;dr version:

  • Click on the “Build website” button in the Build panel in RStudio to build the website
  • The generated site will be in _site/. Put this somewhere online if you want.
  • _site.yml controls what goes in the navigation bar controls other site generation settings
  • index.Rmd is the home page (it is required)
  • R will knit all .Rmd files in the root directory in alphabetical order. To ensure the order they’re knit in (i.e. if one depends on another), prefix them with numbers.
    • By default all the .Rmd files will share the same environment (i.e. if one file runs library(tidyverse), tidyverse functions will be available in the next file). If you don’t want this to happen (you don’t), make sure new_session: true is set in _site.yml, which makes each .Rmd use a clean environment.

Real life examples:

  • The Power of Ranking: The Ease of Doing Business Indicator as a Form of Social Pressure (website; GitHub)
  • NGO Crackdowns and Philanthropy (website; GitHub)
  • Why Donors Donate (website; GitHub)
  • Are Donors Really Responding? Analyzing the Impact of Global Restrictions on NGOs (website; GitHub)

How to use:

  • Download directory
  • Open 05_Rmd-website.Rproj to open a new RStudio instance
  • Click on “Build website” in the “Build” panel
  • Navigate the preview that appears in RStudio or open _site/index.html in your browser
  • Install missing packages as needed (and hope they’re the right version)

rrtools

Ben Marwick’s rrtools package allows you to create a “research compendium,” or a self-contained R package that includes your analysis, data, R functions, and final paper that users can install with devtools::install() (or devtools::install_github() if you have your project hosted at GitHub).

Because the project is structured as a package, R will handle package dependencies for you automatically. You can also include your commonly used custom functions into the package, letting you include things like library(myreproducibleproject) or myreproducibleproject::custom_function() in your project.

Real life examples:

  • Are Donors Really Responding? Analyzing the Impact of Global Restrictions on NGOs (website; GitHub)
  • Why Donors Donate (website; GitHub)

To create your own compendium follow the instructions at the README. Here’s the tl;dr version:

  • Run library(rrtools)
  • Run create_compendium("nameofyourpackage")
  • Open the new RStudio project that rrtools created
  • Put your analysis in analysis/; put your data in analysis/data/; put your paper in analysis/paper/
  • Put custom functions in R/ and use roxygen2 to document them
  • Use library(nameofyourpackage) to access your custom functions
  • Build your package by clicking on “Install and Restart” in the Build panel

In this example, I’ve put an R Markdown website in the analyses folder. Since this project is already a package, the Build panel in R Studio is configured to build a package, not a website. In order to build the website, you’ll need to run rmarkdown::render_site(). I’ve included this in a Makefile in analysis/, so you’ll need to open a terminal panel and type cd analysis, then make html to generate the site.

How to use this example:

  • Download directory
  • Open rrtools.Rproj to open a new RStudio instance
  • Run devtools::install(".", dependencies = TRUE) to install the package and all its dependencies
  • Click on the Terminal panel in RStudio and type cd analysis
  • Type make html
  • Open analysis/_site/index.html in your browser

renv

RStudio’s new (and still in-development) renv package lets you maintain a local project-specific library of packages, similar to Python’s virtualenv and pyenv. The README for renv and the introduction vignette explain how it all works and how to get started. Here’s the tl;dr version:

  • The renv.lock file contains a list of all the packages your project uses, with version number and hashes. Don’t edit this manually; renv has functions that generate and update this for you
  • The renv/activate.R file is a script that tells R to use renv/library/* when you run library(blah).
  • renv/library/ contains a local package structure
  • .Rprofile has a new line in it that runs renv/activate.R when you start a new R session.
  • If you use version control (or if you’re distributing this project to others), you only need to track/include renv.lock, .Rprofile, and renv/activate.R. Don’t include the contents of renv/library/, since that is platform-specific and R will install packages there as needed.

How to use this example:

  • Download directory
  • Open 07_renv.Rproj to open a new RStudio instance
  • Install renv with devtools::install_github("rstudio/renv")
  • Restart your R session (to make .Rprofile run renv/activate.R)
  • Wait as all dependencies are installed automatically
  • Click on “Build website” in the “Build” panel
  • Navigate the preview that appears in RStudio or open _site/index.html in your browser

Docker

Docker allows you to create virtual machines (or containers) and run stuff in them. Containers are essentially miniature Linux computers with different pieces of software pre-installed. They’re great for spinning up computers with exact versions of R and packages. You can access R within the containers through your browser—open a URL like http://localhost:8787 to get to an RStudio instance within the container.

Installing Docker and creating Dockerfiles goes beyond the scope of this little demonstration, but there are a ton of resources out there to get you started:

The main advantage of creating reproducible Docker containers is that it essentially lets other users download and install a complete standalone computer that is configured exactly how it was when you ran your code. It’s like the gold standard of reproducibility.

The Rocker team has made this even more gold standardy for R projects too. They maintain base Docker images for each R version (3.5.1, 3.6.0, etc.), and these images are set up to install packages from MRAN (Microsoft’s snapshot-based mirror of CRAN). This means that if you use R 3.6.0, any packages you install will be at the version they were when R was released.

Real life examples:

How to use this example:

  • Install Docker Desktop for Mac or Docker Desktop for Windows
  • Install Kitematic if you want a GUI for managing Docker containers (you do)
  • Download directory
  • Navigate to the directory in a terminal and type docker build -t myproject . to build all the required pieces
  • Wait while everything gets downloaded
  • Run docker run -e PASSWORD=blah -p 8787:8787 myproject to start the container
  • In your browser, go to http://localhost:8787. Log in using “rstudio” as the user name and “blah” as the password.
  • Open provo-weather.Rmd and knit it

Binder

Binder is a more user-friendly version of the Docker approach to reproducibility. Instead of requiring users to install Docker and build the container image locally, Binder handles all the hosting and provides access to a specific version of R and RStudio in a browser.

It also provides a simpler way to install and configure packages—there’s no need for complicated Dockerfiles (you can still use them, but they’re not recommended). You need two extra files for this to work:

  • runtime.txt, which contains the date for the MRAN snapshot that you want to use for package installation (formatted as r-YYYY-MM-DD)
  • install.R, which contains R code for installing packages

Instructions and examples are here. Here’s the tl;dr version:

  • Make sure your project is in its own public repository either at GitHub or GitLab
  • Create a runtime.txt file and install.R file in the root of your project
  • Go to Binder, paste your repository’s URL into the form, and click on “Launch”
  • Wait for a looooong time (the binder container will rebuild every time you commit to the repository, which will also take a long time; if there are no commits, the container should open fairly quickly)
  • Binder will give you a URL when it’s done. If you open the URL as is, Binder will try to load your R files in a Jupyter notebook, which won’t work. Append ?urlpath=rstudio to the URL to open the project in an RStudio instance (e.g. https://mybinder.org/v2/gh/andrewheiss/binder-example/master?urlpath=rstudio)

That’s all!

How to use this example (this example actually lives in a separate GitHub repository so that it can work with Binder):