OpenCaseStudies

Important links

Disclaimer

The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.

License

This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

Citation

To cite this case study please use:

Wright, Carrie and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).

Acknowledgments

We would like to acknowledge Roger Peng, Megan Latshaw, and Kirsten Koehler for assisting in framing the major direction of the case study.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

Title

Predicting Annual Air Pollution

Motivation

Machine learning methods have been used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.

We will use machine learning methods to predict annual air pollution levels spatially within the US based on data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data.

Motivating question

  1. Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Data

The data that we will use in this case study come from a gravimetric air pollution monitor system operated by the US Enivornmental Protection Agency (EPA) that measures fine particulate matter (PM2.5) in the United States (US). We will use data from 876 gravimetric monitors in in the contiguous US in 2008.

Roughly 90% of these monitors are located within cities.

Hence, there is an equity issue in terms of capturing the air pollution levels of more rural areas. To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate or predict air pollution levels in areas with little to no monitoring.

We will use data related to population density, urbanization, road density, as well as, NASA satellite pollution data and chemical modeling data to predict the monitoring values captured from this air pollution monitoring system.

The data for these 48 predictors comes from the US Enivornmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change.

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

  1. Familiarity with the tidymodels ecosystem
  2. Ability to evaluate correlation among predictor variables (corrplot and GGally)
  3. Ability to implement tidymodels packages such as rsample to split the data into training and testing sets as well as cross validation sets.
  4. Ability to use the recipes, parsnip, and workflows to train and test a linear regression model and random forest model
  5. Demonstrate how to visualize geo-spatial data using ggplot2

Statistical Learning Objectives:

  1. Basic understanding the utility of machine learning for prediction and classification
  2. Understanding of the need for training and test sets
  3. Understanding of the utility of cross validation
  4. Understanding of random forest
  5. How to interpret root mean squared error (rmse) to assess performance for prediction

Analysis

This case study focuses on machine learning methods. We demonstrate how to train and test a linear regression model and a random forest model.

Data import

The data is imported from a CSV file using the readr package.

Data wrangling

This case study does not demonstrate very many data wrangling methods. However we do cover the mutate() and across functions of the dplyr package in the Data wrangling section. In the Data visualzation, some wrangling was required including combining data using the inner_join() function of the dplyr package, using the separate function of the tidyr package to make two columns out of one, and the str_to_title() function of the stringr package to change the format of some character strings.

Data exploration

We demonstrate how to get a summary of a relatively large set of predictors using the skim package, as well as how to evaluate correlation among all variables using the corrplot package and among specific variables with more information using the GGally package.

Statistical concepts

We cover the basics of machine learning: 1) the difference between prediction and classification 2) the importance of training and testing 3) the concept of cross validation and tuning 4) how random forest works

Other notes and resources

  1. A review of tidymodels
  2. A course on tidymodels by Julia Silge
  3. More examples, explanations, and info about tidymodels development from the developers
  4. A guide for pre-processing with recipes
  5. A guide for using GGally to create correlation plots
  6. A guide for using parsnip to try different algorithms or engines
  7. A list of recipe functions
  8. A great blog post about cross validation
  9. A discussion about evaluating model performance for a deeper explanation about how to evaluate model performance
  10. RStudio cheatsheets
  11. An explanation of supervised vs unsupervised machine learning and bias-variance trade-off.
  12. A thorough explanation of principal component analysis.
  13. If you have access, this is a great discussion about the difference between independence, orthogonality, and lack of correlation.
  14. Great video explanation of PCA.

Terms and concepts covered:

Tidyverse
Imputation
Transformation
Discretization
Dummy Variables
One Hot Encoding
Data Type Conversions
Interaction
Normalization
Dimensionality Reduction/Signal Extraction
Row Operations
Near Zero Varaince
Parameters and Hyper-parameters
Supervised and Unspervised Learning
Principal Component Analysis
Linear Combinations
Decision Tree
Random Forest

Packages used in this case study:

Package Use in this case study
here to easily load and save data
readr to import the CSV file data
dplyr to view/arrange/filter/select/compare specific subsets of the data
skimr to get an overview of data
summarytools to get an overview of data in a different style
magrittr to use the %<>% pipping operator
corrplot to make large correlation plots
GGally to make smaller correlation plots
rsample to split the data into testing and training sets and to split the training set for cross-validation
recipes to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are recipe() , prep() and various transformation step_*() functions, as well as bake which extracts pre-processed training data (used to require juice()) and applies recipe preprocessing steps to testing data). See here for more info.
parsnip an interface to create models (major functions are fit(), set_engine())
yardstick to evaluate the performance of models
broom to get tidy output for our model fit and performance
ggplot2 to make visualizations with multiple layers
dials to specify hyper-parameter tuning
tune to perform cross validation, tune hyper-parameters, and get performance metrics
workflows to create modeling workflow to streamline the modeling process
vip to create variable importance plots
randomForest to perform the random forest analysis
doParallel to fit cross validation samples in parallel
stringr to manipulate the text the map data
tidyr to separate data within a column into multiple columns
rnaturalearth to get the geometry data for the earth to plot the US
maps to get map database data about counties to draw them on our US map
sf to convert the map data into a data frame
lwgeom to use the sf function to convert the map geographical data
rgeos to use geometry data
patchwork to allow plots to be combined

For users

There is a Makefile in this folder that allows you to type make to knit the case study contained in the index.Rmd to index.html and it will also knit the README.Rmd to a markdown file (README.md).

For instructors

This case study is intended to introduce fundamental topics in Machine Learning and to introduce how to implement model prediction using the tidymodels ecosystem of packages in R.

Target audience

This case study is intended for those with some familiarity with linear regression and R programming.

Suggested homework

Students can predict air pollution monitor values using a different algorithm and provide an explanation for how that algorithm works and why it may be a good choice for modeling this data.