/research-project-template

This template can be used to begin projects that are reproducible from raw data through published paper or thesis.

Primary LanguageTeX

Tutorial

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Research Project Template

This repository contains a template for a reproducible research project. The fundamental idea of reproducible research is that the steps that take your research from raw data to manuscript, thesis, or report should be fully automated. This way, your work can be checked by your adviser, mentors, collaborators, others working in your area, journal reviewers, and your future self.

I became interested in reproducible research because I was tired of being terrified of my own analysis. I was constantly petrified someone would question my work and ask me to open the black box and verify what I did was correct; in some cases even being asked reproduce a result was terrifying because I knew the convoluted path of data prep and cleaning that I took in arriving at my result.

I read the book Reproducable Research with R and RStudio by Christopher Gandrud (2013), and I read a lot of blog posts and tutorials by Karl Broman and Carl Boettiger and I struck out on my own path to execute a reproducible research project from start to finish. The repository for that project is here, while I was successful in learning the basics of how r, rmarkdown, knitr, and pandoc combine to make reproducible research possible, you can tell just by looking at the project's Github repository that I failed miserably at making the project reproducible. This is because the repository is totally unorganized, and I am pretty sure I am the only one who could reproduce the results from this project.

But in that failure, I learned a lot about how a reproducible research project should be organized. I built this template for my future students and for my future self.

R and RStudio are an excellent vehicle for conducting reproducible research. You write manuscripts and reports in .rmarkdown documents that includes code chunks that perform analysis. The code chunks are evaluated by R and incorporated in the document by the tools in the knitr package to produce a markdown .md document. From there a program called pandoc converts your markdown document to whatever file format you like: PDF (formatted with latex .csl files), html, or Microsoft Word. This all happens without the user really knowing what is going on, which makes it easy to get started.

Getting Started

First download the repository to your local machine. If you use Github, then this will mean cloning the repository into a new R project. If you are not a Github user, simply click 'Download Zip' and extract the file to a convenient location.

Repository Contents

The repository contains several folders and files. They are organized to keep data preparation and cleaning in one file, data-raw, analysis in another, analysis, and outputs from the analysis that will become tables and figures, and numbers in the text of the manuscript, analysis-output. The remaining files in the root directory are files related to the manuscript itself. Next, we'll demonstrate how to link the raw data to the analysis to the output to the manuscript so that all the steps to generate the manuscript are automated and thus, not subject to the inconsistencies that go along with piece-wise data preparation and analysis.

Generate the Manuscript with One Click

Open the manuscript-example.Rmd and tablesandfigures-examples.Rmd files in RStudio. Click the 'Knit PDF' button on the code editing pane. Install the following packages, if they are not already installed:

install.packages(xtable)
install.packages(ggplot2)
install.packages(ggfortify)
install.packages(gridExtra)
install.packages(Quandl)
install.packages(RCurl)
install.packages(xts)
install.packages(urca)
install.packages(vars)

Click the 'Knit PDF' button on the code editing pane, and voila! A PDF of the manuscript should appear. In what follows we will walk through what is happening step by step.

Putting it all Together

Now we see the contents of the data and analysis files and how they come together.

The data-raw Folder

The data-raw folder should either contain your raw data files (that will never ever be modified), or a script that makes and api call, or pulls the raw data in from a shared server, etc. In this example file, there is a script called fetch-raw-data.R, and its contents are shown below. This file fetches corn and soybean price data from quandl.com and puts them in data objects called CZ2016 and SX2016. Then it converts the data to xts objects, and trims the dates to the study period of interest.

# Filename: fetch-raw-data.R
# This file fetches the raw data and performs pre-processing (cleaning) to get it ready for analyzs

library(RCurl)
library(xts)
library(Quandl)
Quandl.api_key("79SfoMaQc1npRAuq9ExZ")
# Define Dates of Analysis
  start  <- '2015-01-01'
  today  <- format(Sys.time(),"%Y-%m-%d")

# Fetch Corn and Soybean Prices
  CZ2016 <- Quandl("CME/CZ2016", type = "xts")
  SX2016 <- Quandl("CME/SX2016", type = "xts")

# Trim the dates
  CZ2016 <- CZ2016[paste0(start,'/',today), 'Settle']
  SX2016 <- SX2016[paste0(start,'/',today), 'Settle']

Of course, every data cleaning and preparation activity will be different, but in this file you should do all the preparation so that the objects created by this script are ready to be accepted in the analysis.R script.

The analysis Folder

The contents of the analysis folder are below. The key is the line that says, source('data-raw/fetch-raw-data.R'). This calls the fetch-raw-data.R script so that when you run the code below, the raw data are fetched and prepared (from scratch each time you run the script). Then, the following contents of the analysis.R script test the corn and soybean prices for the presence of unit roots via the ADF test (Said and Dickey 1984).

# Filename: analysis.R
# This file performs statistical analysis. It could be just one file, so it doesn't neccessarly 
#  need it's own folder, but sometimes your analysis may get complicated enough that you want 
# to compartmentalize it. Separating different types of analyses into different scripts contained
#  in the same folder can facilitate this

library(urca)
library(vars)
# This line runs the source code that fetched your raw data and cleaned it. Now it is available 
# for conducting analysis.
source('data-raw/fetch-raw-data.R')

# Store results of ADF tests for Corn and Soybeans in a list
adf      <- list()
adf[[1]] <- ur.df(CZ2016, type = 'drift', lags = 5)
adf[[2]] <- ur.df(SX2016, type = 'drift', lags = 5) 


# Store results of a Johansen cointegration test for Corn and Soybeans 
jct      <- ca.jo(cbind(CZ2016, SX2016), type = 'eigen', K = 5)


# Fit a VAR

lag_selection <- VARselect(cbind(CZ2016, SX2016), lag.max = 8)

var_model <- VAR(cbind(CZ2016, SX2016), p = 1, type = "const")

# Save these results so that it can be pulled into the manuscript without re-running analysis.
save(adf, jct, lag_selection, var_model, file = 'analysis-output/results.rda')

The analysis-output Folder

The last line of the code snippet above says, save(adf, jct, lag_selection, var_model, file = 'analysis-output/results.rda'). What this does is save the objects that contain the adf, Johansen cointegration, and VAR regression results to an .rda file called results.rda in the analysis folder. This 'R Data' file can be read in by R and the variable names, adf, jct, lag_selection, and var_model are preserved when loaded later. We will load the results.rda file into the tablesandfigures-example.Rmd document to make tables and figures in the manuscript.

The Manuscript

At the top of the file named manuscript-example.Rmd you see a YAML (Yet Another Mark Up Language) header. This header tells knitr and pandoc what exactly you want done with the document.

---
title: "A Very Serious Analysis of the Stationarity of Corn and Soybean Prices"
author: "Peter Economist, Paul Economist, Mary Economist"
date: 'May 06, 2016'
output: 
  pdf_document:
    template: simple.latex
    fig_caption: yes
documentclass: ajae
bibliography: bibliography.bib
---

Title and author are self explanatory.

date: field tells knitr to place the current date formatted in the %B %d, %Y style.

output: After knitr evaluates code chunks contained in the body of the file. The output feild tells pandoc what kind of file to create. Here we have specified to produce PDF output. PDF output is produced by pandoc creating a .tex file and if no further fields are specified there is a latex template that pandoc uses to make the docuement (based on the article class). Here we have specified to create the manuscript according to the specifications of the American Journal of Agricultural Economics (AJAE). Since they have their own latex class (ajae) that comes in the standard latex distribution we can just specify documentclass: ajae and the formatting is handled. We needed to also specify template: simple.latex because something in the pandoc template was clashing with the ajae.csl file. I removed the problem lines and saved that as simple.latex, which you can see in the root directory of this repository. We will cover how to specify different output formats in a later section.

bibliography: The file bibliography.bib is located in the root directory of this repository and it is a Bibtex database of all the references needed for the manuscript. Open this file and note what the reference entries look like. To build a database for your own paper, Google Scholar has a 'cite' button below every search result it returns. Click 'cite', then click 'Bibtex' and a plain text window will open with the properly formatted Bibtex entry. Just copy and paste this into bibliography.bib.

Markdown Basics

Formatting a docuemnt with Markdown is very easy and there are many resources to learn the basics. Start with http://rmarkdown.rstudio.com/index.html and explore.

Main points:

# This is a Level 1 Header

This is a Level 1 Header

## This is a Level 2 Header

This is a Level 2 Header

This is a citation of Akerlof's Lemons paper [@akerlof1970vthe].

This is a citation of Akerlof's Lemons paper (Akerlof 1970).

Code Chunks

This is an example of a code chunk that is in the manuscript document. The opening line tells knitr that what follows is code chunk to be evaluated.

 ``{r, echo=FALSE, warning = FALSE, message = FALSE, results = "asis"}
 t = list()
 t[[1]] <- xtable(adf[[1]]@testreg, caption = "ADF Results for Corn")
 t[[2]] <- xtable(adf[[2]]@testreg, caption = "ADF Results for Soybeans")
 print.xtable(t[[1]], caption.placement = 'top', comment = FALSE)
 ``

In the opening code chunk, we specify that we want to load the results from the analysis-output folder and we also want to fetch the raw data, which we will plot in a later code chunk. Also, we load all the libraries that will be used by later code chunks.

``{r, warning = FALSE, message = FALSE, echo=FALSE}
library(xtable)
library(ggplot2)
library(ggfortify)
library(gridExtra)
source('data-raw/fetch-raw-data.R')
load('analysis-output/results.Rda')
```

A Note About Latex Styles and the YAML Header

Getting the pandoc default latex and the latex style you prefer to work can be a little tricky. In the root directory of this repository there is a file called style-headers.md This contains a few complete YAML headers that should work if you copy and paste them to replace the current manuscript's YAML header.

Collaborating with Microsoft Word Users

Many of us have colleagues who expect to recieve and be welcomed to edit Microsoft Word documents. Fortunately, reproducability can be maintained. With the manuscript-example.Rmd file open, notice that the knit PDF button is actually a drop down menu and knit Word is an option. If you click it, it will return to you a Microsoft Word document that you can deliver to your colleague or professor.

They can be formatted with a .docx template. See the 'Style Reference` description on this page. Using the template will keep you having to format the whole thing every time you update your colleagues and professors.

You might have noticed that there is also a tablesandfigures.Rmd file in the root directory. This is for users who will need to produce Word documents. I have found no clean way to produce decently formatted tables and figures in Word using this method. I recommend keeping tables and figures in a separate docuemnt that you always render as a PDF, and a separate file for the manuscript text.

Equations are still a problem. Pandoc can interpret math symbols surrounded by $, as $\exp^{i \pi} = -1$ will be rendered as expi**π =  − 1. However, these equations are not automatically numbered. To get automatically numbered equations that can be cross-referenced, they must be produced with pure Latex code. As in,

\begin{equation} \Delta y_t = \alpha + \gamma y_{t-1} + \delta_1 \Delta y_{t-1} + \dots + \delta_{p-1} \Delta y_{t-p-1} + \epsilon_t \end{equation}

being rendered as,

Δyt = α + γ**yt − 1 + δ1Δyt − 1 + … + δp − 1Δyt − p − 1 + ϵt

The trouble is that Latex is ignored by pandoc when producing Word documents so when you knit the Word document after writing your equations in pure Latex, they will be missing from the Word docuemnt. This means you will have to replace them in the Word document one way or another. There is a reasonable workaround. Iguana Tex is a Microsoft Powerpoint add-in that takes latex equations and returns copy-and-pasteable figures of typeset equations. I recommend creating one new slide for each equation in your document, then use Iguana Tex to obtain figures of your equations that can be pasted into the Word document.

Starting Your Own Reproducable Project

Once you understand how all the peices fit together you can modify these files to conduct your own reproducable project. Just make sure your data-raw is accessed by your analysis scripts and that your results are stored in the anlaysis-output folder. Then make sure your manuscript pulls the data and analysis results automatically.

References

Akerlof, George. 1970. “The Market for Lemons: Qualitative Uncertainty and the Market MechanismV.” Quarterly Journal of Economics 84.

Gandrud, Christopher. 2013. Reproducible Research with R and R Studio. CRC Press.

Said, Said E, and David A Dickey. 1984. “Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order.” Biometrika 71 (3). Biometrika Trust: 599–607.