/granolarr

A reproducible resource for teaching geographic data science in R

Primary LanguageRGNU General Public License v3.0GPL-3.0

granolarr

NOTE: I have deactivated the website for this repository due to on-going issues with the Nokogiri library that was used by the website component of this repository.

A new version of these materials is currently under development as R for Geographic Data Science.

The granolarr hex sticker granolarr is a geogGRaphic dAta scieNce, reprOducibLe teAching resouRce in R

by Stefano De Sabbata

The materials included in granolarr (see the granolarr GitHub Pages) have been designed for a module focusing on the programming language R as an effective tool for data science. R is one of the most widely used programming languages, and it provides access to a vast repository of programming libraries, which cover all aspects of data science from data wrangling to statistical analysis, from machine learning to data visualisation. That includes a variety of libraries for processing spatial data, perform geographic information analysis, and create maps. As such, R is an extremely versatile, free and opensource tool in geographic information science, which combines the capabilities of traditional GIS software with the advantages of a scripting language, and an interface to a vast array of algorithms.

The materials aim to cover the necessary skills in basic programming, data wrangling and reproducible research to tackle sophisticated but non-spatial data analyses. The first part of the module will focus on core programming techniques, data wrangling and practices for reproducible research. The second part of the module will focus on non-spatial data analysis approaches, including statistical analysis and machine learning.

The lecture slides use #EAE2DF as background colour to aviod the use of a pure white background, which can make reading more difficult and slower for people with dyslexia. For the same reason, all foreground colours have also been checked for readability using Colour Contrast Analyser. The practical sessions materials can be accessed online in their bookdwon version, where Seppia and Night themes are available and they can be downloaded in pdf or epub format from the top menu. The practical sessions materials can also be downloaded separately in pdf format for printing.

Note: This is a revised version of granolarr, currently under development to meet the University of Leicester "Ignite" approach to blended learning for the academic year 2020/2021. The first version of granolarr is still available at granolarr_v1.

Table of contents

Materials

All the materials are available through the lectures bookdown and practical sessions bookdown pages. Links to the lecture slides and bookdown chapters for each week are listed below.

  1. R coding
    • 100 Introduction
      • 101 Lecture (slides, bookdown)
        • Introduction to R
      • 102 Lecture (slides, bookdown)
        • Core concepts
      • 103 Lecture (slides, bookdown)
        • Tidyverse
      • 104 Practical session (bookdown)
        • The R programming language
        • Interpreting values
        • Variables
        • Basic types
        • Tidyverse
        • Coding style
    • 110 R programming
      • 111 Lecture (slides, bookdown)
        • Data types (vectors, factors, matrices, arrays, lists)
      • 112 Lecture (slides, bookdown)
        • Control structures (conditional statements, loops)
      • 113 Lecutre (slides, bookdown)
        • Functions
      • 114 Practical session (bookdown)
        • Vectorss
        • Lists
        • Conditional statements
        • Loops
        • Functions
        • Scope of a variable
  2. Data wrangling
  3. Data analysis
    • 300 Exploratory data analysis
      • 301 Lecture (slides, bookdown)
        • Data visualisation
      • 302 Lecture (slides, bookdown)
        • Descriptive statistics
      • 303 Lecture (slides, bookdown)
        • Exploring assumptions
      • 304 Practical session (bookdown)
        • Data visualisation
        • Descriptive statistics
        • Exploring assumptions
    • 310 Comparing data
      • 311 Lecture (slides, bookdown)
        • Comparing groups
      • 312 Lecture (slides, bookdown)
        • Correlation
      • 313 Lecture (slides, bookdown)
        • Data transformations
      • 314 Practical session (to do)
        • Comparing means
        • Correlation
        • Chi-square
    • 320 Regression models
      • 321 Lecture (slides, bookdown)
        • Simple regression
      • 322 Lecture (slides, bookdown)
        • Assessing regression assumptions
      • 323 Lecture (to do) (slides, bookdown)
        • Multiple regression
      • 324 Practical session (bookdown)
        • Simple regression
        • Testing assumptions
        • Multiple regression
  4. Machine learning
    • 400 Supervised
      • 401 Lecture (slides, bookdown)
        • Introduction to Machine Learning
      • 412 Lecture (to do) (slides, bookdown)
        • Artificial Neural Networks
      • 413 Lecture (to do) (slides, bookdown)
        • Support vector machines
      • 414 Practical session (to do) (bookdown)
        • Support vector machines
    • 410 Unsupervised
      • 411 Lecture (to do) (slides, bookdown)
        • Principal Component Analysis
      • 402 Lecture (slides, bookdown)
        • Centroid-based clustering
      • 403 Lecture (slides, bookdown)
        • Hierarchical and density-based clustering
      • 404 Practical session (to do) (bookdown)
        • Geodemographic classification

Reference books

Suggested reading

  • R for Data Science by Garrett Grolemund and Hadley Wickham, O'Reilly Media, 2016. See online book.
  • Machine Learning with R: Expert techniques for predictive modeling by Brett Lantz, Packt Publishing, 2019. See book webpage.

Further reading

  • Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R by Michael Freeman and Joel Ross, Addison-Wesley, 2019. See book webpage and repository.
  • The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff, No Starch Press, 2011. See book webpage.
  • Discovering Statistics Using R by Andy Field, Jeremy Miles and Zoë Field, SAGE Publications Ltd, 2012. See book webpage.
  • An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013. See book webpage.
  • Introduction to Machine Learning with R by Scott V. Burger, O'Reilly Media, 2018. See book webpage.
  • Machine Learning with R, the tidyverse, and mlr by Hefin I. Rhys, Manning Publications, 2020. See book webpage.
  • Deep Learning with R by François Chollet with J. J. Allaire, Manning Publications, 2018. See book webpage.
  • An Introduction to R for Spatial Analysis and Mapping by Chris Brunsdon and Lex Comber, Sage, 2015. See book webpage.
  • Geocomputation with R by Robin Lovelace, Jakub Nowosad and Jannes MuenchowSee, CRC Press, 2019. See online book.

Reproducibility

Instructor

You can now reproduce granolarr using Docker. First install Docker on your system, install Git if not already installed, and clone this repository from GitHub. You can then either build the sdesabbata/granolarr image running the Docker_Build.sh script in the root directory of the repository or simply pull the latest sdesabbata/granolarr image from the Docker Hub.

You should now have all the code and the computational environment to reproduce these materials, which can be done by running the script Docker_Make.sh (Docker_Make_WinPowerShell.sh on Windows using PowerShell) from the repository folder. The script will instantiate a Docker container for the sdesabbata/granolarr image, bind mount the repository folder to the container and execute Make.R on the container, clearing and re-making all the materials. The data used in the materials can be re-created from the original open data using the scripts in src/utils, as described in data/README.md.

For instance, in a unix-based system like Linux or Mac OS, you can reproduce granolarr using the following four commands:

docker pull sdesabbata/granolarr:latest
git clone https://github.com/sdesabbata/granolarr.git
cd granolarr
./Docker_Make.sh

This approach should allow not simply to use the materials as they are, but to easily edit and create your own version in the same computational environment. To develop your own materials, simply modify the code in the repository and run the Docker_Make.sh from the repository folder again to obtain the updated materials.

The RMarkdown code used to create the materials for the lectures and practical sessions can be found in the src/lectures and src/practicals folders, respectively. Both folders contain one RMarkdown file per session which contains the headings necessary to create the respective html slides (compiled to docs/lectures/html) and pdf documents (compiled to docs/practicals/pdf), whereas the main corpus of the materials can be found in the files included in the respective contents folders. The latter files are also used directly to generate the Bookdown version of the materials (which are compiled to docs/lectures/bookdown and docs/practicals/bookdown). The docs folder also contains the files used to generate the GitHub Pages website using the Minimal Mistakes Jekyll theme. The utils folder also contains the IOSlides templates and some style classes used in the RMarkdown code.

.
├── DockerConfig
├── data
├── docs
│   ├── _data
│   ├── _pages
│   ├── _posts
│   ├── assets
│   │   └── images
│   ├── exercises
│   ├── lectures
│   │   ├── bookdown
│   │   └── html
│   └── practicals
│       ├── bookdown
│       └── pdf
└── src
    ├── lectures
    │   ├── contents
    │   └── images
    ├── practicals
    │   ├── contents
    │   ├── images
    │   └── materials
    └── utils
        ├── IOSlides
        └── RMarkdown

You can edit the materials in the granolarr repository folder using RStudio or another editor on your computer and then compile the new materials using Docker. Alternatively, you can follow the learner instructions below to start RStudio Server using Docker, and develop your materials in the same environment in which they will be compiled. The first option might be quicker for minor edits, whereas the latter option might be preferable for substantial modifications, and especially when you might need to test your code.

Learner

As a learner, you can use Docker to follow the practical sessions instructions and complete the exercises. First install Docker on your system, install Git if not already installed, and clone this repository from GitHub.

You can then either build the sdesabbata/granolarr image running the Docker_Build.sh script in the root directory of the repository or simply pull the latest sdesabbata/granolarr image from the Docker Hub.You should now have all the code and the computational environment to reproduce these materials, which can be done by running the script Docker_RStudio_Start.sh (Docker_RStudio_Start_WinPowerShell.sh on Windows using PowerShell) from the repository folder.

For instance, in a unix-based system like Linux or Mac OS, you can set up and start the granolarr container using the following four commands:

docker pull sdesabbata/granolarr:latest
git clone https://github.com/sdesabbata/granolarr.git
cd granolarr
./Docker_RStudio_Start.sh

The Docker_RStudio_Start.sh script will first create a my_granolarr folder in the parent directory of the root directory of the repository (if it doesn't exitst). The script will then instantiate a Docker container for the sdesabbata/granolarr image, bind mount the my_granolarr folder and the granolarr repository folder to the container and start an RStudio Server.

Using your browser, you can access the RStudio Server running from the Docker container by typing 127.0.0.1:28787 in your address bar, and using rstudio as username and rstudio as password. As the my_granolarr folder is binded, everything that you will save in the the my_granolarr folder in your home directory on RStudio Server will be saved on your computer. Everything else will be lost when the Docker container is stopped.

To stop the Docker container, running the script Docker_RStudio_Stop.sh (same on Windows using PowerShell) from the repository folder.

License and acknowledgements

Stefano De Sabbata

This work is licensed under the GNU General Public License v3.0 except where specified. Contains public sector information licensed under the Open Government Licence v3.0, see data / README.md. See src / ;ectures / images / README.md, src / practicals / images / README.md and src / utils / IOSlides / README.md for information regarding the images used in the materials.

This repository includes teaching materials that were created by Dr Stefano De Sabbata for the module GY7702 R for Data Science, while working at the School of Geography, Geology, and the Environment of the University of Leicester. Stefano would also like to acknowledge the contributions made to parts of these materials by Prof Chris Brunsdon and Prof Lex Comber (see also An Introduction to R for Spatial Analysis and Mapping, Sage, 2015), Dr Marc Padilla, and Dr Nick Tate, who conveened previous versions of the module (GY7022) at the University of Leicester.

Files in the Data folder have been derived from data by sources such as the Office for National Statistics, Ministry of Housing, Communities & Local Government, Ofcom, and other institutions of the UK Government under the Open Government License v3 -- see linked webpage above on the National Archives website or the LICENSE file in this folder).

This content was created using R, RStudio, RMarkdown, Bookdown, and GitHub.