This guide provides instructions for using R on research projects. Its purpose is to use with collaborators and research assistants to make code consistent, easier to read, transparent, and reproducible.
For coding style practices, follow the tidyverse style guide.
- While you should read the style guide and do your best to follow it, once you save the script you can use
styler::style_file()
to fix its formatting and ensure it adheres to the tidyverse style guide.- Note:
styler::style_file()
overwrites files (if styling results in a change of the code to be formatted). The documentation strongly suggests to only style files that are under version control or to create a backup copy.
- Note:
- Use
tidyverse
and/ordata.table
for wrangling data.- For big data (millions of observations), the efficiency advantages of
data.table
become important. - The efficiency advantages of
data.table
can be important even with smaller data sets for tasks likerbind
ing, reshaping (h/t Grant McDermott's benchmarks), etc.
- For big data (millions of observations), the efficiency advantages of
- Use
stringr
for manipulating strings. - Use
lubridate
for working with dates. - Use
conflicted
to explicitly resolve namespace conflicts.conflicted::conflict_scout()
displays namespace conflictsconflicted::conflict_prefer()
declares the package to use in namespace conflicts, and theconflict_prefer()
calls should be a block of code towards the top of the script, underneath the block oflibrary()
calls.
- Never use
setwd()
or absolute file paths. Instead, use relative file paths with thehere
package.- To avoid conflicts with the deprecated
lubridate::here()
, if using both packages in a script, specifyconflict_prefer("here", "here")
.
- To avoid conflicts with the deprecated
- Use
assertthat::assert_that()
frequently to add programmatic sanity checks in the code. - Use pipes like
%>%
frommagrittr
. See here for more on using pipes. Other useful pipes are the compound assignment pipe%<>%
(which, unlike Hadley, I like to use) and the%$%
exposition pipe. - Use my package
tabulator
for some common data wrangling tasks.tabulator::tab()
efficiently tabulates based on a categorical variable, sorts from most common to least common, and displays the proportion of observations with each value, as well as the cumulative proportion.tabulator::tabcount()
counts the unique number of categories of a categorical variable or formed by a combination of categorical variables.tabulator::quantiles()
produces quantiles of a variable. It is a wrapper for base Rquantile()
but is easier to use, especially withindata.table
s ortibble
s.
- Use
fixest
for fixed effects regressions; it is much faster thanlfe
(and also appears to be faster than the best current Julia or Python implementations of fixed effects regression). - Use
modelsummary
for formatting tables. Hmisc::describe()
andskimr::skim()
can be useful to print a "codebook" of the data, i.e. some summary stats about each variable in a data set. Since they do not provide identical information, it might be best to run both.- This can be used in conjunction with
sink()
to print the codebook to a text file. For example:
library(tidyverse) library(Hmisc) library(skimr) library(here) # Write codebook to text file sink(here("results", "mtcars_codebook.txt")) mtcars %>% describe() %>% print() # print() needed if running script from command line mtcars %>% skim() %>% print() sink() # close the sink
- This can be used in conjunction with
Generally, within the folder where we are doing data analysis, we have:
- An .Rproj file for the project. (This can be created in RStudio, with File > New Project.)
- Note that if you always open the Project within RStudio before working (see "Project" in the upper right-hand corner of RStudio) then the
here
package will work for relative filepaths.
- Note that if you always open the Project within RStudio before working (see "Project" in the upper right-hand corner of RStudio) then the
- data - only raw data go in this folder
- documentation - documentation about the data goes in this folder
- proc - processed data sets go in this folder
- results - results go in this folder
- figures - subfolder for figures
- tables - subfolder for tables
- scripts - code goes in this folder
- programs - a subfolder containing functions called by the analysis scripts (if applicable)
Because we often work with large data sets and efficiency is important, I advocate (nearly) always separating the following three actions into different scripts:
- Data preparation (cleaning and wrangling)
- Analysis (e.g. regressions)
- Production of figures and tables
The analysis and figure/table scripts should not change the data sets at all (no pivoting from wide to long or adding new variables); all changes to the data should be made in the data cleaning scripts. The figure/table scripts should not run the regressions or perform other analysis; that should be done in the analysis scripts. This way, if you need to add a robustness check, you don't necessarily have to rerun all the data cleaning code (unless the robustness check requires defining a new variable). If you need to make a formatting change to a figure, you don't have to rerun all the analysis code (which can take awhile to run on large data sets).
- Include a 00_run.R script (described below).
- Number scripts in the order in which they should be run, starting with 01.
- Because a project often uses multiple data sources, I usually include a brief description of the data source being used as the first part of the script name (in the example below,
ex
describes the data source), followed by a description of the action being done (e.g.dataprep
,reg
, etc.), with each component of the script name separated by an underscore (_
).
Keep a "run" script, 00_run.R that lists each script in the order they should be run to go from raw data to final results. Under the name of each script should be a brief description of the purpose of the script, as well all the input data sets and output data sets that it uses. Ideally, a user could run 00_run.R
to run the entire analysis from raw data to final results (although this may be infeasible for some projects, e.g. one with multiple confidential data sets that can only be accessed on separate servers).
- Also include objects that can be set to 0 or 1 to only run some of the scripts from the 00_run.R script (see the example below).
Below is a brief example of a 00_run.R script. (Note that you might replace scripts 03 and 04 with .Rmd files depending on how you want to present the results.)
# Run script for example project
# PACKAGES ------------------------------------------------------------------
library(here)
# PRELIMINARIES -------------------------------------------------------------
# Control which scripts run
run_01_ex_dataprep <- 1
run_02_ex_reg <- 1
run_03_ex_table <- 1
run_04_ex_graph <- 1
# RUN SCRIPTS ---------------------------------------------------------------
# Read and clean example data
if (run_01_ex_dataprep) source(here("scripts", "01_ex_dataprep.R"), encoding = "UTF-8")
# INPUTS
# here("data", "example.csv") # raw data from XYZ source
# OUTPUTS
# here("proc", "example.rds") # cleaned
# Regress Y on X in example data
if (run_02_ex_reg) source(here("scripts", "02_ex_reg.R"), encoding = "UTF-8")
# INPUTS
# here("proc", "example.rds") # 01_ex_dataprep.R
# OUTPUTS
# here("proc", "ex_fixest.rds") # fixest object from feols regression
# Create table of regression results
if (run_03_ex_table) source(here("scripts", "03_ex_table.R"), encoding = "UTF-8")
# INPUTS
# here("proc", "ex_fixest.rds") # 02_ex_reg.R
# OUTPUTS
# here("results", "tables", "ex_fixest_table.tex") # tex of table for paper
# Create scatterplot of Y and X with local polynomial fit
if (run_04_ex_graph) source(here("scripts", "04_ex_graph.R"), encoding = "UTF-8")
# INPUTS
# here("proc", "example.rds") # 01_ex_dataprep.R
# OUTPUTS
# here("results", "figures", "ex_scatter.eps") # figure
- Use
ggplot2
, and for graphs with color consider colorblind-friendly palettes such asscale_color_viridis_*()
orggthemes::scale_color_colorblind()
. - I wrote a function
set_theme.R
to standardize and facilitate graph formatting. It can be added to aggplot
object like any other theme would be, e.g.:but it differs from other themes in that you can directly change its default formatting withinlibrary(tidyverse) # Use the defaults mtcars %>% ggplot() + geom_point(aes(y = hp, x = wt)) + labs(y = "Horsepower", x = "Weight") + set_theme()
set_theme()
. For example:See# Change margins mtcars %>% ggplot() + geom_point(aes(y = hp, x = wt)) + labs(y = "Horsepower", x = "Weight") + set_theme( y_title_margin = "r = 5", x_title_margin = "t = 5", plot_margin = unit(c(t = 2, r = 2, b = 2, l = 2), "pt") )
set_theme_reprex.R
for more examples of its use with changes to its defaults, and look at the function itself to see what the arguments and graph formatting settings that it can change are. (Pull requests welcome to expand it to more use cases.) - For reproducible graphs (independent of the size of your Plots pane in RStudio), always specify the
width
andheight
arguments inggsave()
. - To see what the final graph looks like, open the file that you save since its appearance will differ from what you see in the RStudio Plots pane when you specify the
width
andheight
arguments inggsave()
. - For high resolution, save graphs as .eps or .pdf files.
- I've written a Python function
crop_eps
to crop (post-process) .eps files when you can't get the cropping just right withggplot2
. crop_pdf
coming soon
- I've written a Python function
- For maps, use the
sf
package. This package makes plotting maps easy (withggplot2::geom_sf()
), and also makes other tasks like joining geocoordinate polygons and points a breeze.
-
For small data sets, save as .csv with
readr::write_csv()
and read withreadr::read_csv()
. (Note: thereadr
package is part oftidyverse
.)- When reading in a large .csv file from another source, it can be worth using
data.table::fread()
for speed improvements. - For smaller .csv files I prefer
readr::read_csv()
due to a number of nice features such as the way it handles columns with dates.
- When reading in a large .csv file from another source, it can be worth using
-
For medium-sized data sets, save as .rds with
saveRDS()
orreadr::write_rds()
, and read withreadRDS()
orreadr::read_rds()
.readr::write_rds()
is a wrapper forsaveRDS()
that specifiescompress = FALSE
by default. The trade-off is that compressing (the default insaveRDS()
) will make the file substantially smaller so it takes up less disk space, but it will take longer to read and write.
-
For large data sets when read and write performance becomes important, the
qs
package performs quick serialization of R objects. Our benchmarks suggest thatqs::qsave()
compresses files slightly more thansaveRDS()
and is faster thanreadr::write_rds()
, so it provides the best of both worlds.- We still recommend
saveRDS()
orreadr::write_rds()
for medium-sized data sets for which read and write speed is not an issue, since these functions have a longer history of use and support.
- We still recommend
-
When doing a time-consuming
map*()
or loop, e.g. reading in and manipulating separate data sets for each month, it is a good idea to save intermediate objects as part of the function being called bymap*()
or as part of the loop. That way, if something goes wrong you won't lose all your progress.
When randomizing assignment in a randomized control trial (RCT):
- Seed: Use a seed from https://www.random.org/: put Min 1 and Max 100000000, then click Generate, and copy the result into your script. Towards the top of the script, assign the seed with the line
where
seed <- ... # from random.org
...
is replaced with the number that you got from random.org - Use the
randomizr
package. Here is a cheatsheet of the different randomization functions. - Immediately before the line using a randomization function, include
set.seed(seed)
. - Build a randomization check: create a second variable a second time with a new name, repeating
set.seed(seed)
immediately before creating the second variable. Then check that the randomization is identical usingassert_that(all(df$var1 == df$var2))
. - As a second randomization check, create a separate script that runs the randomization script once (using
source()
) but then saves the data set with a different name, then runs it again (withsource()
), then reads in the two differently-named data sets from these two runs of the randomization script and ensures that they are identical. - Note: if creating two cross-randomized variables, you would not want to repeat
set.seed(seed)
before creating the second one, otherwise it would use the same assignment as the first.
Above I described how data preparation scripts should be separate from analysis scripts. Randomization scripts should also be separate from data preparation scripts, i.e. any data preparation needed as an input to the randomization should be done in one script and the randomization script itself should read in the input data, create a variable with random assignments, and save a data set with the random assignments.
Once you complete a script, which you might be running line by line while you work on it, make sure the script works on a fresh R session. To do this from RStudio:
- Ctrl+Shift+F10 to restart the R session running behind the scenes in RStudio.
- Ctrl+Shift+Enter to run the whole script
To avoid inefficiently saving and restoring the workspace when closing and opening RStudio, go to Tools > Global Options... > General and:
- Uncheck "Restore .RData into workspace at startup"
- For "Save Workspace to .RData on exit", select "Never"
- Click OK
Similarly, when running R scripts from the command line, specify the --vanilla
option to avoid ineffecient saving/restoring of the workspace.
When calling source()
within one script (or the RStudio Console) to run another script, always specify the argument encoding = "UTF-8"
. This ensures that code with special characters (e.g., letters with accent marks) will run correctly.
Use renv
(instead of packrat
) to manage the packages used in an RStudio project, avoiding conflicts related to package versioning.
renv::init()
will develop a "local library" of the packages employed in a project. It will create the following files and folders in the project directory:renv.lock
,.Rprofile
, andrenv/
. Binaries of the project's packages will be stored in therenv/library/
subfolder.- When working on the project, use
renv::snapshot()
to update yourrenv
-related files. Make sure these are updated when pushing project changes to GitHub, sharing files with others, or preparing the replication package. - When deploying the project on a different machine, make sure that the
renv.lock
,.Rprofile
, andrenv/
files/folders are present. Therenv/library/
subfolder, containing system-specific package binaries, should be excluded. Once these requirements are met, you can launch the project and runrenv::restore()
to restore all packages in the new machine, in their appropriate versions.
Github is an important tool to maintain version control and for reproducibility purposes. There are many tutorials online, like Grant Mcdermott's slides here, and I will share some tips from these notes. I will provide instructions for only the most basic commands here.
We need to first create a git repository or clone an existing one.
- To clone an existing github repository, use
git pull repolink
where repolink is the link copied from the repository on Github. - To initialize a new repo, use
git init
in the project directory
Once you you have initialized a git repository and you make some local changes, you will want to update the Github repository, so that other collaborators can have access to updated files. To do so, we will use the following process:
- Check the status:
git status
. I like to use this frequently, in order to see file you've changed and those you still need to add or commit. - Add files for staging:
git add <filename>
. Use this to add local changes to the staging area. - Commit:
git commit
. This command saves your changes to the local repository. It will prompt you to add a commit message, which can be more concisley written asgit commit -m "Helpful message"
- Push changes to github: assuming there is a Github repository created, use
git push origin master
to push your saved local changes to the Github repo.
However, there are often times when we encounter merge conflicts. A merge conflict is an event that occurs when Git is unable to automatically resolve differences in code between two commits. For example, if a collaborator edits a piece of code, stages, commits and pushes a change, and you try to push changes for the same piece of code, you will encounter a merge conflict. We need to figure out how fix these conflicts.
- I like to start with
git status
which shows the conflicted files. - If you open up the conflicted files with any text editor, you will see a couple of things.
<<<<<<< HEAD
shows the start of the merge conflict.=======
shows the break point used for comparison.>>>>>>> <long string>
shows the end of merge conflict.
- You can now manually edit the code and delete whatever lines of code you don't want and the special chanracters that Git added in the file. After that you can stage, commit and push your files without conflict.
Sometimes RStudio projects don't play nicely with Dropbox syncing because Dropbox is trying to continuously sync the .Rproj.user file while you are editing code. This leads to a frequent pop-up error message "The process cannot access the file because it's being used by another process." To solve this issue on Windows:
- Open your .Rproj project in RStudio
- Run the function
dropbox_project_sync_off()
(details)
Here I will present the best methods for linking a project on both Dropbox and Github, which is inspired, but modified from this tutorial). The RA (or whomever is setting up the project) should complete ALL of the following steps. Others need to do only the steps marked with (All). Before going ahead, make sure you have both a Github account, a Dropbox account, and the Dropbox app downloaded on your computer. The main idea of this setup is that our Dropbox will serve as an extra clone where we can share new raw data, but the main version control will be done on Github.
-
First, establish the Dropbox folder for the project. Create a Dropbox folder, share it with all project members, and let's call the project we are working on "SampleProject". In this step, we aren't doing anything with Github.
-
The RA will now create a github repo for the project, name it identically to the Dropbox folder and clone it locally to your computer.
-
(All) Clone the github repo locally by going to terminal. I will clone this in my home directory. To do so, I would type
cd noahforougi
git clone repolink
where repolink is the link copied from the repository page on Github.com. It is important that when you clone this repository, you are doing it in a directory that is not associated with Dropbox.
- The next step is to clone the repository again, but this time in the local Dropbox directory. So, for example, say I have cloned the project in this directory
/Users/noahforougi/SampleProject/
. I will now change the directory to my Dropbox directory and clone the Github repo to the Dropbox.
cd /Dropbox/SampleProject/
git clone repolink
- Now, we want to create a more formal project structure. To do so, we are going to edit the Dropbox directly (we will only be doing this once!). Follow the conventions mentioned earlier in this guide, create the project on Dropbox, but exclude the proc folder. The dropbox should look something like this:
- Dropbox/
- SampleProject/
- data/
- documentation/
- README.md
- results/
- scripts/
- SampleProject/
- (All) We want the Dropbox project structure (which the RA has created) on our local repo which is synched with Github(in my case, the /Users/noahforougi/ directory). We will only have to do this once, but we are going to manually copy and paste all the folders into our local repo. Additionally, create a proc/ folder locally. This allows us to share raw data via Dropbox, but the processed data will be generated by actually running the scripts locally. Our project should look like this:
- noahforougi/
- SampleProject/
- data/
- proc/
- documentation/
- README.md
- results/
- scripts/
- SampleProject/
- (All) We want to create a .gitignore file in the local directory. This means when we push our local changes to Github, we are ignoring the data/, proc/ and documentation/ folders. This is crucial because of data confidentiality reasons. There are plenty of tutorials online about how to create a .gitignore file. In the gitignore file please include the following:
/documentation/*
/data/*
/proc/*
- Now, go back into the Dropbox folder and repeat this step. We need to create a .gitignore file in the Dropbox as well.
Our project structure is complete. We can now make local edits to the scripts and results and push them to Github. All other project members will be able to receive these changes and update their local proc/ files by running the newly synched scripts. The main interactions should be to push local edits to Github. You should not be making edits to the scripts located on the Dropbox. If we want to share new raw data, we will need to copy and paste that locally, but it will not cause issues because of the .gitignore file.
Some additional tips:
- Error handling: use
purrr::possibly()
andpurrr::safely()
rather than base RtryCatch()
- Progress bars: for intensive
purrr::map*()
tasks you can easily add progress bars withdplyr::progress_estimated()
(instructions) - If you need to log some printed output, a quick and dirty way is
sink()
.