Machine Learning in R

Thanks for attending my sessions at ODSC East 2023. This repo will hold code we write in Machine Learning in R workshop.

Setup

For this course you need a recent version of R. Anything greater than 4.0 is good but 4.2 is even better. I also highly recommend using your IDE/code editor of choice. Most people use either RStudio or VS Code with R Language Extensions.

After you have R and your favorite editor installed, you should install the packages needed today with the following line of code.

install.packages(c(
  'here', 'markdown', 'rmarkdown', 'knitr', 'tidyverse', 'ggthemes', 'ggridges', 
  'tidymodels', 'coefplot', 'glmnet', 'xgboost', 'vip', 'DiagrammeR', 'here', 
  'DBI', 'themis', 'vetiver', 'fable', 'tsibble', 'echarts4r', 'leaflet', 
  'leafgl', 'leafem'
))

Git

If you are comfortable with git, you can clone this repo and have the project structure.

git clone https://github.com/jaredlander/odsceast2023.git

Docker

If you are having trouble installing R or the packages, but are comfortable with Docker, you can pull the Docker image using the following command in your terminal.

docker pull jaredlander/odsceast2023:4.2.3

You can run the container with the following command which will also mount a folder as a volume for you to use.

docker run -it --rm --name rstudio_ml -e PASSWORD=password -e ROOT=true -p 8787:8787 -v $PWD/workshop:/home/rstudio/workshop  jaredlander/odsceast2023:4.2.3

Codespaces

The Docker image should work natively in GitHub Codespaces so you can run a remote instance of VS Code with all the packages ready to go. You can theoretically even launch RStudio from within the VS Code instance, though I haven’t figured that out yet.

Code

Throughout the class I will be pushing code to this repo in case you need to catch up. Most, if not all, will be in the code folder.

Workshop Plan

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. A combination of supervised learning (regression-like models) and unsupervised learning (clustering), the field is supported by theory, yet relies upon intelligent programming for implementation.

In this training session we will work through the entire process of training a machine learning model in R. Starting with the scaffolding of cross-validation, onto exploratory data analysis, feature engineering, model specification, parameter tuning and model selection. We then take the finished model and deploy it as an API in a Docker container for production use.

We will make extensive use the tidymodels framework of R packages.

Preparing Data for the Modeling Process

The first step in a modeling project is setting up the evaluation loop in order to properly define a model’s performance. To accomplish this we will learn the following tasks:

  1. Load Data
  2. Create train and test sets from the data using the rsample package
  3. Create cross-validation set from the train set using the rsample package
  4. Define model evaluation metrics such as RMSE and logloss using the yardstick package

EDA and Feature Engineering

Before we can fit a model we must first understand the model by performing exploratory data analysis. After that we prepare the data through feature engineering, also called preprocessing and data munging. The primary steps we will learn include:

  1. Perform summary EDA with dplyr
  2. Visualize the data with ggplot2
  3. Balance the data with the themis package
  4. Impute or otherwise mark missing data with the recipes package
  5. Perform data transformations with the recipes package
    1. Numeric centering and scaling
    2. Collapse noisy categorical data
    3. Handle new categorical values
    4. Convert categorical data into dummy (or indicator) variables

Model Fitting and Parameter Tuning

Now we can begin fitting models. This involves defining the type of model, such as a penalized regression, random forest or boosted tree. This has been simplified thanks to the parsnip and workflows packages. Modern machine learning has essentially become an excercise in brute-forcing over tuning parameters, which we will do by combining the dials and tune package with the previously created cross-validation set.

  1. Define the model structure with the parsnip package
  2. Set tuning parameter candidates with the dials package
  3. Iterate over the tuning parameter candidates using the tune package to perform cross-validation
  4. Identify the best model fit with the yardstick package

Deploying the Model into Production

After we build various machine learning models we need to make them accessible to others. We use the plumber package to expose our model as a REST API that can be hosted in a Docker container.

  1. Make predictions using the workflows package
  2. Convert the model to an API using the plumber package
  3. Bundle the model object and API code into a Docker container
  4. Serve that container and use curl to make perform predictions