This repository contains source code for a set of experiments that assess the performance of multiple approaches to model stacking with the stacks package.
The scripts in this repo benchmark the existing implemented
meta-learner, a regularized linear model, against a set of proposed
alternative meta-learners. They rely on a branch of the stacks package
which introduces a meta_learner
argument to blend_predictions
,
allowing for combining predictions from member models with any modeling
workflow. That version of the package can be installed with the
following code:
pak::pak("tidymodels/stacks@general-meta")
Install of the dependencies needed to run this experiment with
pak::local_install_dev_deps()
.
The analyses
folder contains a series of scripts that benchmark model
stacking with several combinations of datasets and meta-learners.
The structure of each sub-folder in analyses
, for a dataset called
dataset
, is as follows:
dataset/
prepare_dataset.R
: A script that prepares data and fits a series of preprocessors and models to resamples of a dataset. Data splits resulting from this script are saved todataset_data.RData
. Model fit objects resulting from this script are saved to thecandidate_fits/
sub-directory.dataset_data.RData
: Data splits resulting fromprepare_dataset.R
.candidate_fits/
: A folder containing model fits given somepreproc
essor andmodel
on resamples of thedataset
. Each of the objects are stored as a row of a workflow set, and can be row-binded together to form a fittedworkflow_set
object.dataset_preproc1_model1.RData
dataset_preproc1_model2.RData
- …
fit_members_dataset.R
: A script that reads in each element ofcandidate_fits/
and fits all of them on the entire training set. The needed results from this script can then be dropped in to model stacks with fitted meta-learners as “fitted members.” Doing this step separately from the usual stacks pipeline allows for only fitting each base learner on the entire training set only once, rather than for each unique combination of preprocessor and model.member_fits/
: A folder containing model fits given somepreproc
essor andmodel
on the training set of thedataset
.dataset_preproc1_model1.RData
dataset_preproc1_model2.RData
- …
blend_scripts/
:blend_dataset_preproc1_model1.R
: A script that reads in each element ofcandidate_fits/
, row-binds them together to form a workflow set, generates a data stack using the workflow set, fits thepreproc
essor andmodel
as a meta-learner to the data stack, drops in needed fitted members, and then generates some basic metrics with the fitted model stack. These metrics are saved asdataset_preproc1_model1.Rdata
undermetrics/
.blend_dataset_preproc1_model2.R
- …
The top-level folder metrics
contains the “output” from each of these
experiments, a five-element list with the dataset name, preprocessor and
model specification for the meta-learner, time to fit, and test set
performance metrics. The files are named in the format
dataset_preproc_model.RData
.
The top-level folder meta_learners
contains the code used to generate
the proposed preprocessors and model specifications.
The naming schemes in these experiments are chosen for straightforward extensibility:
- Run all of the data preparation scripts and workflow set fitting
scripts (
.R
files starting withprepare_
) - Run all of the member fitting scripts (
.R
files starting withfit_members_
) - Run all of the blending + benchmarking scripts (
.R
files starting withblend_
)
The code that I use to run the experiment is in run.R
.
ID | Recipe | Model Spec |
---|---|---|
basic_glmnet |
Minimal | Penalized Linear Regression |
basic_xgb |
Minimal | Boosted Tree (via XGBoost) |
basic_lgb |
Minimal | Boosted Tree (via LightGBM) |
normalize_bt |
Center + Scale | Bagged Tree |
normalize_bm |
Center + Scale | Bagged Mars |
normalize_svm |
Center + Scale | Support Vector Machine (via RBF) |
normalize_nn |
Center + Scale | Multi-layer Perceptron (Neural Network) |
pca_bt |
Principal Component Analysis | Bagged Tree |
pca_bm |
Principal Component Analysis | Bagged Mars |
pca_svm |
Principal Component Analysis | Support Vector Machine (via RBF) |
pca_nn |
Principal Component Analysis | Multi-layer Perceptron (Neural Network) |
renormalize_svm |
C+S, PCA, C+S | Support Vector Machine (via RBF) |
renormalize_nn |
C+S, PCA, C+S | Multi-layer Perceptron (Neural Network) |