- Jennifer Bryan: @jennybc
- Hilmar Lapp: @hlapp
- Ciera Martinez @iamciera
- Kristina Riemer kristina.riemer@weecology.org
- Courtney Soderberg courtney@cos.io
- Naupaka Zimmerman naupaka@gmail.com
Students will learn the benefits of project and folder organization,
and how these enable reproducibility and reusability. They will then
complete an activity highlighting the structure of data files,
emphasizing the importance of documenting any changes made. Finally,
they will bring these two activities together in the context of a
reproducible project workflow centered around using knitr
in
RStudio.
At the begining of the session, students should be able
- to use a spreadsheet program to generate a plot
- to use a text editor (Word, Google Docs, etc.) to communicate
- be familiar with Rstudio: Rstudio layout, running R commands, knitr, and basic ggplot syntax (from Intro section)
At the end of the session students will be able to
- Evaluate folder and file structure of a project.
- Recognize common problems that occur in file organization.
- Be able to identify what plain text is.
- Demonstrate the benefits of using plain text.
- Distinguish between input and output files.
- Integrate file naming standards to projects.
- Distinguish between a spreadsheet formatted properly for later analysis in R and one formatted improperly
- Be able to recognize common data entry errors and how to handle them
- Be able to describe the concept of 'raw data' and why it is important
- Differentiate between manual and programmatic file manipulation and know the pros and cons of each
Lesson: 01-file-organization
This section starts with an activity to get the students thinking about "excavating" a folder in the future. It is meant to get the students thinking about what file names, file organization, and file content and what these can tell us about a project.
Slides can be found here, in Keynote format, in PDF, and as individual PNGs (scroll down through the README):
Lesson: 02-programatic-modification
In this section, the students will explore why it is beneficial to do programmatic modification by exploring what it takes to clean up a data file in Excel.
Note: could overlap in part with Intro, Activity 2; may require on-the-fly adjustments in response to that.
Students "knit" and modify. Using countryPick4.Rmd as a template, students learn how to import data, filter to one country, make a plot, write it to file, and comment data choices. Then the activity will illustrate what happens when you knit:
- Preview/Knit HTML, note what sorts of outputs are left behind.
- Discuss input and output files.
- Which files can we delete and reproduce? Which files are inputs, outputs, converters of inputs to outputs?
This section is meant for students to explore the power of writing reports in R.
- EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp (2013) "Nine simple ways to make it easier to (re)use your data." Ideas in Ecology and Evolution 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f (in particular the section "Use standard table formats")
- WS Noble (2009) "A Quick Guide to Organizing Computational Biology Projects." PLoS Computational Biology 5 (7): e1000424. doi:10.1371/journal.pcbi.1000424
- File Naming Conventions & Best Practices (UBC Library, Research Data Management)
- File Format Considerations (UBC Library, Research Data Management)
- File naming guides and suggestions from
- Wikipedia entry on list of filename extensions
- Wikipedia entry on ISO 8601 standard for dates
- Good practice guidance on releasing statistics in spreadsheets (UK Government)
-
Processed and subset (population size, life expectancy, GDP per capita; only every 5 years only starting 1952, only complete records) Gapminder data as R package. The data-raw sub-directory reveals the journey from Gapminder.org's Excel workbooks to increasingly clean and tidy data.
-
clean dataset can be located in R in the following way (after installing the package):
pathToTsv <- system.file("gapminder.tsv", package = "gapminder")
-
-
All other lesson material is dedicated to the public domain under the CC Zero waiver.