This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.
This repository contains a template for a reproducible research project. The fundamental idea of reproducible research is that the steps that take your research from raw data to manuscript, thesis, or report should be fully automated. This way, your work can be checked by your adviser, mentors, collaborators, others working in your area, journal reviewers, and your future self.
I became interested in reproducible research because I was tired of being terrified of my own analysis. I was constantly petrified someone would question my work and ask me to open the black box and verify what I did was correct; in some cases even being asked reproduce a result was terrifying because I knew the convoluted path of data prep and cleaning that I took in arriving at my result.
I read the book Reproducable Research with R and RStudio by Christopher Gandrud (2013), and I read a lot of blog posts and tutorials by Karl Broman and Carl Boettiger and I struck out on my own path to execute a reproducible research project from start to finish. The repository for that project is here, while I was successful in learning the basics of how r, rmarkdown, knitr, and pandoc combine to make reproducible research possible, you can tell just by looking at the project's Github repository that I failed miserably at making the project reproducible. This is because the repository is totally unorganized, and I am pretty sure I am the only one who could reproduce the results from this project.
But in that failure, I learned a lot about how a reproducible research project should be organized. I built this template for my future students and for my future self.
R and RStudio are an excellent vehicle for conducting reproducible
research. You write manuscripts and reports in .rmarkdown
documents
that includes code chunks that perform analysis. The code chunks are
evaluated by R
and incorporated in the document by the tools in the
knitr
package to produce a markdown .md
document. From there a
program called pandoc
converts your markdown document to whatever file
format you like: PDF (formatted with latex .csl files), html, or
Microsoft Word. This all happens without the user really knowing what is
going on, which makes it easy to get started.
First download the repository to your local machine. If you use Github, then this will mean cloning the repository into a new R project. If you are not a Github user, simply click 'Download Zip' and extract the file to a convenient location.
The repository contains several folders and files. They are organized to
keep data preparation and cleaning in one file, data-raw
, analysis in
another, analysis
, and outputs from the analysis that will become
tables and figures, and numbers in the text of the manuscript,
analysis-output
. The remaining files in the root directory are files
related to the manuscript itself. Next, we'll demonstrate how to link
the raw data to the analysis to the output to the manuscript so that all
the steps to generate the manuscript are automated and thus, not subject
to the inconsistencies that go along with piece-wise data preparation
and analysis.
Open the manuscript-example.Rmd
and tablesandfigures-examples.Rmd
files in RStudio. Click the 'Knit PDF' button on the code editing pane.
Install the following packages, if they are not already installed:
install.packages(xtable)
install.packages(ggplot2)
install.packages(ggfortify)
install.packages(gridExtra)
install.packages(Quandl)
install.packages(RCurl)
install.packages(xts)
install.packages(urca)
install.packages(vars)
Click the 'Knit PDF' button on the code editing pane, and voila! A PDF of the manuscript should appear. In what follows we will walk through what is happening step by step.
Now we see the contents of the data and analysis files and how they come together.
The data-raw
folder should either contain your raw data files (that
will never ever be modified), or a script that makes and api call,
or pulls the raw data in from a shared server, etc. In this example
file, there is a script called fetch-raw-data.R
, and its contents are
shown below. This file fetches corn and soybean price data from
quandl.com and puts them in data objects called
CZ2016 and SX2016
. Then it converts the data to xts
objects, and
trims the dates to the study period of interest.
# Filename: fetch-raw-data.R
# This file fetches the raw data and performs pre-processing (cleaning) to get it ready for analyzs
library(RCurl)
library(xts)
library(Quandl)
Quandl.api_key("79SfoMaQc1npRAuq9ExZ")
# Define Dates of Analysis
start <- '2015-01-01'
today <- format(Sys.time(),"%Y-%m-%d")
# Fetch Corn and Soybean Prices
CZ2016 <- Quandl("CME/CZ2016", type = "xts")
SX2016 <- Quandl("CME/SX2016", type = "xts")
# Trim the dates
CZ2016 <- CZ2016[paste0(start,'/',today), 'Settle']
SX2016 <- SX2016[paste0(start,'/',today), 'Settle']
Of course, every data cleaning and preparation activity will be
different, but in this file you should do all the preparation so that
the objects created by this script are ready to be accepted in the
analysis.R
script.
The contents of the analysis
folder are below. The key is the line
that says, source('data-raw/fetch-raw-data.R')
. This calls the
fetch-raw-data.R
script so that when you run the code below, the raw
data are fetched and prepared (from scratch each time you run the
script). Then, the following contents of the analysis.R
script test
the corn and soybean prices for the presence of unit roots via the ADF
test (Said and Dickey 1984).
# Filename: analysis.R
# This file performs statistical analysis. It could be just one file, so it doesn't neccessarly
# need it's own folder, but sometimes your analysis may get complicated enough that you want
# to compartmentalize it. Separating different types of analyses into different scripts contained
# in the same folder can facilitate this
library(urca)
library(vars)
# This line runs the source code that fetched your raw data and cleaned it. Now it is available
# for conducting analysis.
source('data-raw/fetch-raw-data.R')
# Store results of ADF tests for Corn and Soybeans in a list
adf <- list()
adf[[1]] <- ur.df(CZ2016, type = 'drift', lags = 5)
adf[[2]] <- ur.df(SX2016, type = 'drift', lags = 5)
# Store results of a Johansen cointegration test for Corn and Soybeans
jct <- ca.jo(cbind(CZ2016, SX2016), type = 'eigen', K = 5)
# Fit a VAR
lag_selection <- VARselect(cbind(CZ2016, SX2016), lag.max = 8)
var_model <- VAR(cbind(CZ2016, SX2016), p = 1, type = "const")
# Save these results so that it can be pulled into the manuscript without re-running analysis.
save(adf, jct, lag_selection, var_model, file = 'analysis-output/results.rda')
The last line of the code snippet above says,
save(adf, jct, lag_selection, var_model, file = 'analysis-output/results.rda')
.
What this does is save the objects that contain the adf, Johansen
cointegration, and VAR regression results to an .rda
file called
results.rda
in the analysis
folder. This 'R Data' file can be read
in by R and the variable names, adf
, jct
, lag_selection
, and
var_model
are preserved when loaded later. We will load the
results.rda
file into the tablesandfigures-example.Rmd
document to
make tables and figures in the manuscript.
At the top of the file named manuscript-example.Rmd
you see a YAML
(Yet Another Mark Up Language) header. This header tells knitr and
pandoc what exactly you want done with the document.
---
title: "A Very Serious Analysis of the Stationarity of Corn and Soybean Prices"
author: "Peter Economist, Paul Economist, Mary Economist"
date: 'May 06, 2016'
output:
pdf_document:
template: simple.latex
fig_caption: yes
documentclass: ajae
bibliography: bibliography.bib
---
Title
and author
are self explanatory.
date
: field tells knitr to place the current date formatted in the
%B %d, %Y
style.
output
: After knitr evaluates code chunks contained in the body of the
file. The output feild tells pandoc what kind of file to create. Here we
have specified to produce PDF output. PDF output is produced by pandoc
creating a .tex
file and if no further fields are specified there is a
latex template that pandoc uses to make the docuement (based on the
article
class). Here we have specified to create the manuscript
according to the specifications of the American Journal of Agricultural
Economics (AJAE). Since they have their own latex class (ajae
) that
comes in the standard latex distribution we can just specify
documentclass: ajae
and the formatting is handled. We needed to also
specify template: simple.latex
because something in the pandoc
template was clashing with the ajae.csl
file. I removed the problem
lines and saved that as simple.latex
, which you can see in the root
directory of this repository. We will cover how to specify different
output formats in a later section.
bibliography
: The file bibliography.bib
is located in the root
directory of this repository and it is a Bibtex database of all the
references needed for the manuscript. Open this file and note what the
reference entries look like. To build a database for your own paper,
Google Scholar has a 'cite' button below every search result it returns.
Click 'cite', then click 'Bibtex' and a plain text window will open with
the properly formatted Bibtex entry. Just copy and paste this into
bibliography.bib
.
Formatting a docuemnt with Markdown is very easy and there are many resources to learn the basics. Start with http://rmarkdown.rstudio.com/index.html and explore.
Main points:
# This is a Level 1 Header
## This is a Level 2 Header
This is a citation of Akerlof's Lemons paper [@akerlof1970vthe].
This is a citation of Akerlof's Lemons paper (Akerlof 1970).
This is an example of a code chunk that is in the manuscript document.
The opening line tells knitr
that what follows is code chunk to be
evaluated.
``{r, echo=FALSE, warning = FALSE, message = FALSE, results = "asis"}
t = list()
t[[1]] <- xtable(adf[[1]]@testreg, caption = "ADF Results for Corn")
t[[2]] <- xtable(adf[[2]]@testreg, caption = "ADF Results for Soybeans")
print.xtable(t[[1]], caption.placement = 'top', comment = FALSE)
``
In the opening code chunk, we specify that we want to load the results
from the analysis-output
folder and we also want to fetch the raw
data, which we will plot in a later code chunk. Also, we load all the
libraries that will be used by later code chunks.
``{r, warning = FALSE, message = FALSE, echo=FALSE}
library(xtable)
library(ggplot2)
library(ggfortify)
library(gridExtra)
source('data-raw/fetch-raw-data.R')
load('analysis-output/results.Rda')
```
Getting the pandoc default latex and the latex style you prefer to work
can be a little tricky. In the root directory of this repository there
is a file called style-headers.md
This contains a few complete YAML
headers that should work if you copy and paste them to replace the
current manuscript's YAML header.
Many of us have colleagues who expect to recieve and be welcomed to edit
Microsoft Word documents. Fortunately, reproducability can be
maintained. With the manuscript-example.Rmd
file open, notice that the
knit PDF
button is actually a drop down menu and knit Word
is an
option. If you click it, it will return to you a Microsoft Word document
that you can deliver to your colleague or professor.
They can be formatted with a .docx template. See the 'Style Reference` description on this page. Using the template will keep you having to format the whole thing every time you update your colleagues and professors.
You might have noticed that there is also a tablesandfigures.Rmd
file
in the root directory. This is for users who will need to produce Word
documents. I have found no clean way to produce decently formatted
tables and figures in Word using this method. I recommend keeping tables
and figures in a separate docuemnt that you always render as a PDF, and
a separate file for the manuscript text.
Equations are still a problem. Pandoc can interpret math symbols
surrounded by $
, as $\exp^{i \pi} = -1$
will be rendered as
expi**π = − 1. However, these equations are not
automatically numbered. To get automatically numbered equations that can
be cross-referenced, they must be produced with pure Latex code. As in,
\begin{equation} \Delta y_t = \alpha + \gamma y_{t-1} + \delta_1 \Delta y_{t-1} + \dots + \delta_{p-1} \Delta y_{t-p-1} + \epsilon_t \end{equation}
being rendered as,
Δyt = α + γ**yt − 1 + δ1Δyt − 1 + … + δp − 1Δyt − p − 1 + ϵt
The trouble is that Latex is ignored by pandoc when producing Word documents so when you knit the Word document after writing your equations in pure Latex, they will be missing from the Word docuemnt. This means you will have to replace them in the Word document one way or another. There is a reasonable workaround. Iguana Tex is a Microsoft Powerpoint add-in that takes latex equations and returns copy-and-pasteable figures of typeset equations. I recommend creating one new slide for each equation in your document, then use Iguana Tex to obtain figures of your equations that can be pasted into the Word document.
Once you understand how all the peices fit together you can modify these
files to conduct your own reproducable project. Just make sure your
data-raw
is accessed by your analysis scripts and that your results
are stored in the anlaysis-output
folder. Then make sure your
manuscript pulls the data and analysis results automatically.
Akerlof, George. 1970. “The Market for Lemons: Qualitative Uncertainty and the Market MechanismV.” Quarterly Journal of Economics 84.
Gandrud, Christopher. 2013. Reproducible Research with R and R Studio. CRC Press.
Said, Said E, and David A Dickey. 1984. “Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order.” Biometrika 71 (3). Biometrika Trust: 599–607.