Mungebits2 defines a way of thinking about data preparation that couples the definition of what happens in batch processing versus online prediction so that both can be described by the same codebase. This means you can use R to run your munging in a production system on streaming single rows of data without writing additional code.
This is a re-implementation of mungebits that removes the need for non-standard evaluation, and provides better integration with stageRunner.
The package has full test coverage and documentation, so you are encouraged to peek into the internals.
The idea behind mungebits grew out of a year-long session attempting to productionize R code without translating it into another programming language.
Almost every package that implements a statistical predictor requires the user to provide a wrangled dataset, that is, one stripped of outliers, with correctly coerced types, and an array of other "data preparation" aspects that may affect the final performance of the model.
Consider, for example, making use of a categorical variable that has many unique values, some of which occur commonly and others incredibly rarely. It may improve performance of some classifiers to take the rare values, say those which occur with a frequency of less than 5% in the data set, and label them as the value "OTHER".
The choice of which variables make it into the "OTHER" label is determined by the training set, which may differ across random cross-validation splits and change as an organization gathers more data or the distribution shifts, such as due to a changing consumer base or market conditions.
When one refits a model with the new dataset, it would be ideal if the data preparation automatically reflected the updated values by picking the set of labels that occur with greater than 5% frequency and labeling all others as "OTHER".
In code, we may say that
during_training <- function(factor_column) {
frequencies <- table(factor_column)
most_common <- names(which(frequencies / length(factor_column) > 0.05))
factor_column <- factor(
ifelse(factor_column %in% most_common, factor_column, "OTHER"),
levels = c(most_common, "OTHER")
)
list(new_column = factor_column, most_common = most_common)
}
# Let's create an example variable.
factor_column <- factor(rep(1:20, 1:20))
output <- during_training(factor_column)
factor_column <- output$new_column
# We would hold on to output$most_common and "feed it" to
# munging code that ran in production on single data points.
during_prediction <- function(factor_column, most_common) {
factor(ifelse(factor_column %in% most_common, factor_column, "OTHER"),
levels = c(most_common, "OTHER"))
}
# Notice we have re-used our earlier code for constructing the new
# column. We will have to use the above function for munging in
# production and supply it the list `most_common` levels computed
# earlier during training.
single_data_point <- 5
stopifnot(identical(
during_prediction(5, output$most_common),
factor("OTHER", levels = c(as.character(11:20), "OTHER"))
))
single_data_point <- 15
stopifnot(identical(
during_prediction(15, output$most_common),
factor("15", levels = c(as.character(11:20), "OTHER"))
))
# In a real setting, we would want to operate on full data.frames
# instead of only on atomic vectors.
It may seem silly to create a factor variable with a single value
and a surplus of unused levels, but that is only the case if you
have never tried to productionize your data science models! Remember,
even if you trained a simple regression, your factor columns will need
to be converted to 0/1 columns using something like the model.matrix
helper function, and this will yell at you if the correct levels are not
there on the factor column.
The point of mungebits is to replace all that hard work--which in the experience of the author has sometimes spanned data preparation procedures composed of hundreds of steps like the above for collections of thousands of variables--with the much simplified
# During offline training.
replace_uncommon_levels_mungebit$run(dataset)
The mungebit has now been "trained" and remembers the common_levels
defined earlier. In a production system, we will be able to run the
exact same code on a single row of data, as long as we serialize
the mungebit object and recall it during production. This gives us
a streaming machine learning engine that includes hard data
wrangling work--in R.
# During real-time prediction.
replace_uncommon_levels_mungebit$run(dataset)
After understanding mungebits, data science will stop being data janitor work and you will get back to the math.
For more examples of practical mungebits, see any of the functions exported in the syberiaMungebits repository. Examples include
- Parsing dates
- Discretization
- Dropping sparse variables
- Imputation
- Deriving new variables
- Ordering by a key
- Stripping outliers
- Renaming variables
- Selecting a subset of rows
- Sure independence screening
- Replacing values
- Handling special codes in character data
- And so on...
This package is not yet available from CRAN (as of October 26, 2015). To install the latest development builds directly from GitHub, run this instead:
if (!require("devtools")) install.packages("devtools")
devtools::install_github("syberia/mungebits2")
This project is licensed under the MIT License:
Copyright (c) 2015-2016 Syberia, Avant, Robert Krzyzanowski
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This package was created by Robert Krzyzanowski, rob@syberia.io. It is based on the original package mungebits by the same author.