Welcome to the Wisconsin Dropout Early Warning System (DEWS). DEWS is a machine learning application built on top of the DPI data warehouse in the R statistical computing language. DEWS is designed to be a flexible and semi-automated machine learning system which evaluates dozens of possible machine learning algorithms for predicting dropout and selects the highest performing model to use to assign risk scores to current students. Many of the details behind DEWS have been discussed in other places including on the DPI website: http://www.dpi.wi.gov/dews. The technical details of the machine learning approach are discussed in Knowles 2015.
This document is to describe the DEWS program itself. DEWS is, by design, a modular application allowing it the flexibility to be adapted as DPI's data changes, new measures become available, and new machine learning techniques are developed. This modularity consists of four main subroutines (Knowles 2015).
- Prepare your environment
- Data Acquisition
- Transform Data
- Train Models
- Score Cases
Each subroutine has a number of steps within it. In most cases DEWS contains a custom R function to apply each of these steps. This document will describe the rest of that workflow and highlight the key functions involved.
DEWS is designed to work from a clean install of R using the DEWSbootstrap
function. DEWS uses packrat
from RStudio to freeze the version of all package
dependencies from the last known working run in order to ensure that results
can be repeated and that changes in package dependencies will not break the
performance of DEWS. DEWS depends on a large number of package dependencies (80+)
due to the use of many packages that provide a single algorithm. The packrat
functionality ensures that changes in these packages do not break the ability
of DEWS models to run or be used to recreate predictions. Packages can be updated
as they are demonstrated to not cause a breaking change in the DEWS workflow.
This also allows DEWS to rely on dependencies that are not avilable on CRAN, such
as the core DEWS package, EWStools
which is currently only available on GitHub.
To start DEWS on a new machine with an installation of R, the user needs only to
call:
packrat::restore()
library(DEWS)
DEWSbootstrap()
This is the least flexible part of the program and the piece most tightly integrated with DPI's specific data systems. Changes in the data system or underlying data will require modifying the functions in these packages.
Access means authenticating with each of the data systems necessary to obtain
the data desired for use in the prediction. In the DPI case, this has been
consolidated to a single data warehouse instance, and thus only one authentication
token is required. This is handled by a custom internal DPI R package, dpiR
,
which helps users authenticate to DPI's myriad of data systems within R.
A key to a successful early warning system is consistent data over many years of
student observations. The data warehouse handles the storing and cleaning of much
of the data, and thus the project must begin with a query against the data warehouse.
DEWS uses a nested query system to query each of several data domains separately,
and then wrap each of those queries around a master query that resolves the records
into student-year observations and normalizes some inconsistencies. These
functions are contained in R\dataExtract.R
. The key function is assembleCohort
,
which grabs all of the student level data for a cohort of students. A cohort is
defined as all students in a given grade/school-year combination. They are then
followed to graduation, regardless of their retention/early promotion. Within
these functions are the business rules for assigning a student as a graduate or
as a non-completer.
In addition to acquiring the data and enforcing business rules, these functions also create aggregate measures of the peers in the cohort. Thus, school-grade-year level aggregates of the percentage of students economically disadvantaged, on an IEP, suspended, as well as mean attendance and test scores, are attached to the student level data. This will prove useful in model training to allow contextual factors to influence student probabilities.
Currently DEWS does not contain any imputation routines, but this section will be added in the next release.
Combining data from all of the various data sources is handled in the assembleCohort
function, which combines the results of the pullDemog
, pullWSAS
, pullDiscipline
,
and pullMobility
functions, as well as the function which computes the peer
measures peerStats
into a single R dataframe.
To make the model training process as efficient as possible, the data needs to be transformed into a method most easily consumed by the machine learning routines later in the analysis pipeline.
The data is shaped into student-wide format with each row representing a student and each column representing a measure of the student. In the DEWS case the measures are currently all from a single year, but there is no reason that multiple years of measures could not be included. At the end of these columns is the outcome variable, a binary indicator of the students' on-time graduation status. In most cases, where available, multiple cohorts are stacked together.
The function that handles most of this workload is buildTrainingPool
, which
calls assembleCohort
for all grade N cohorts for which graduation data is
available, combines those cohorts together, and then runs a function to enforce
data cleaning and business rules on the result -- cleanTrainingPool
. The
cleaning function enforces a host of business rules such as collapsing some
categorical variables into fewer categories, dealing with missing data, and
changing the labels of some categories. This function
The cleaned and stacked data is now reshaped into a model matrix. This is done
by using the model.matrix
and formula
methods in R to expand all categorical
variables into binary variables and apply polynomial transformations to continuous
variables where appropriate. The gatherData
function calls buildTrainingPool
which then transforms the training dataset into a model matrix suitable for
prediction.
For ease of workflow later on, the data is stored as nested list
objects in R.
The list is divided into training, test, and validation data. Each of these lists
contains a pred
object which is the model.matrix of all of the predictor variables
and a class
object, which is a vector of the dependent variable stored as a
two-level factor. The conversion to a list is handled by the EWStools::assembleData
function, a convenience function drawn from the open-source core library of DEWS
available at www.github.com/jknowles/EWStools
An optional step before storing the predictor matrices as lists is to preprocess
the predictors. The most common operations are to center and scale the predictors
by subtracting the mean and dividing by the standard deviations. DEWS does this
using the preProcess.DEWSList
function, which centers and scales the training,
test, and validation data all relative to the values in the training data. For
binary variables, the mean is substracted and for continuous variables standard
centering and scaling is done using the preProcess
method in the caret
package.
The preProcess
object used to apply the pre-processing is saved for later so the
variable values can be converted back into their original scale post-prediction.
All of the steps above are in service of creating a clean and well-organized
dataset that can be systematically analyzed by a number of different predctive
methods. The functions here form the heart of the early warning system and
many of them have previously been released in the EWStools
package. DEWS provides
an example of a specific implementation of these more general tools that respects
the peculiarities and limits of the local data.
The functions in DEWS
take care of automating the process of parallelizing model
fit and model construction by specifying the number of CPUs availalbe in the
DEWScontrol
function. The parallelization customizations are platform agnostic
and should work on any Windows, Mac, or Linux environment with multple CPUs and
sufficient RAM.
The caret
package provides dozens and dozens of potential algorithms for
prediction purposes. Not all of these algorithms are appropriate for use in the
DEWS context and not all of them are feasible to fit. Limitations on specific
algorithms include the size of the data (some algorithms are not efficient with
very large datasets), the environment the models are run (some algorithms require
access to external libraries such as a compiler, Java, or other outside software),
the hardware restrictions (depending on data size upwards of 32GB of RAM may be
necessary), and data structure (some algorithms do not handle highly correlated
predictors, repeated measures, or sparse predictor data for rare categories).
Depending on the work done in the prior steps, the available algorithms will be
reduced. In Wisconsin's case, approximately 30 algorithms are found to be feasible
to fit and suitable to the data.
To conduct a model search, we first run the DEWScontrol
function which defines
the control parameters for the search. This function allows the user to decide
which accuracy metric to optimize, how to cross-validate models, how many CPUs to
use, whether to upsample the data to improve class balance, and other features.
Currently DEWScontrol
defines three specific environments for the Wisconsin
implementation, but it can be modified to meet any number of parameters.
The workhorse function of this is DEWS_search
which is a Wisconsin specific
wrapper around the EWStools::modelSearch
function. This function automates the
process of training an algorithm to the training data, assessing its accuracy on
the test data, and storing the results for a series of user specified methods
available in the caret
package. The result is a dataframe which stores the
names of each method, its accuracy statistics, and a measure of how long the
model takes to train. This dataframe is used to calculate the most promising
models.
There is no "best" model. In the Wisconsin system, models are chosen predominantly
for their job minimizing their distance from a perfect prediction, but careful
consideration is also given to the length of time the model takes to run. Instead
of doing this ad hoc, the user defines a function which identifies models based
on criteria established by the user. The constructModels
function takes care
of this, and then creates ensembled versions of these models and caches that
to the disk. Currently, the function chooses the N most efficient models, where
efficiency is the ratio of the model AUC to the square root of its running time.
Additionally, the N most dissimilar models are chosen, to compare an ensemble
that is chosen specifically for algorithmic diversity.
The constructModels
also takes care of ensembling
the models as well using
the caretEnsemble
package. For both the dissimilar and most efficient models
two specific ensembles are created -- a greedy optimization algorithm and a
caretStack
which fits a random forest metamodel to the predictions of each
of the models.
Modifying the constructModel
function allows the user to change the way models
are selected and to change the behavior for how ensembles are created.
Models must be stored for later scoring. This part requires a plan for storing
and distinguishing the specific model which will be needed for prediction later.
The constructModels
function stores the ensembled models as well as a validation
data set and the preProcessing object necessary to map predictors back to their
original scales. This is stored in a binary .rda
file. This file is named
programatically based on the date of the run and the cohorts included.
Alternatives exist and can be implemented by modifying this function.
Model scoring is the point of the whole system -- to take what the models have 'learned' from the historical cohorts and assess the likeliest outcome for students currently in those grades. In Wisconsin we score students based on the their most recent available grade level of data, and provide this information ahead of the next grade. So we score a 6th grader in September using their 5th grade data (and consequentially, the 5th grade model).
It is also important to have a defined output file target in mind. In the DPI
case we are writing out a .csv
which has a file specification defined by the
IT department to make their process of ingesting that data back to the data
warehouse as seamless as possible. Other more sophisticated and streamlined
approaches are possible including exposing the model as an API in a webservice
that can be called, scoring the student records in database, or granting a special
server instance direct write access to the database and directly outputting results
to a staging table.
Any of these approaches will require a similar set of steps.
The storage of the prediction models is critical and the naming scheme for accessing the correct model needs to be clear.
DEWS includes several functions to extract and reshape data on current students to ensure that the data is compatible with the stored model objects for the purposes of making predictions. The test scores must have the same distribution and be on the same scale. Attendance, discipline, and demographic categories and values must be on the same scale as well. This can be complex in cases where there is a long lag time between the cohort used for building the models and the scoring cohort; i.e. the lower the grade level being scored. DEWS includes a few intermediary functions to assist in this alignment, but depending on the history of measures available in a data warehouse, more rigorous equating and rescaling functions may be necessary.
Finally predictions have to be tested to make sure they are ready to be integrated into the data warehouse.
Store the predictions in a format for integration back into the data warehouse.