Pete Mohanty August 17, 2018
Now on CRAN, kerasformula
offers a high-level interface to keras neural nets. kerasformula
streamlines everything from data manipulation to model design to cross-validation and hyperparameter selection.
kms
, as in keras_model_sequential()
, is a regression-style function that lets you build keras
neural nets with R
formula
objects. kms()
accepts a number of parameters, allowing users to customize the number of units, layers, activation function, loss function, optimizer, and so on. kms()
accepts a number of parameters (like loss and optimizer) and splits the data into (optionally sparse) test and training matrices. kms()
facilitates setting advanced hyperparameters (e.g., dropout rate and regularization) to prevent overfitting. kms()
optionally accept a compiled keras_sequential_model()
. kms()
returns a single object with predictions, a confusion matrix, and function call details.
kms
accepts the major parameters found in library(keras)
as inputs (loss function, batch size, number of epochs, etc.) and allows users to customize basic neural nets which, by default, now include regularizers. kms
also accepts a compiled keras_model_sequential
to kms
as an argument (preferable for more complex models). The examples here (and the in the examples folder) don't provide particularly predictive models so much as show how using formula
objects can smooth data cleaning and hyperparameter selection.
A worked example can be found on the RStudio Tensorflow website here: Twitter data.
kerasformula
is now available on CRAN. It assumes both that library(keras)
is installed and configured.
install.packages(kerasformula)
library(kerasformula)
install_keras() # see ?install_keras for install options like GPU
To install the development version kerasformula,
devtools::install_github("rdrr1990/kerasformula")
kms
accepts the major parameters found in library(keras)
as inputs (loss function, batch size, number of epochs, etc.) to allows users to customize neural nets. kms
splits training and test data into optionally-sparse matrices.kms
also auto-detects whether the dependent variable is continuous, categorical, or binary.
This document shows how to fit a neural net with kerasformula
using an Amazon AWS database of about 3,000 popular movies.
library(kerasformula)
library(ggplot2)
movies <- read.csv("http://s3.amazonaws.com/dcwoods2717/movies.csv")
dplyr::glimpse(movies)
Observations: 2,961
Variables: 11
$ title <fct> Over the Hill to the Poorhouse, The Broadw...
$ genre <fct> Crime, Musical, Comedy, Comedy, Comedy, An...
$ director <fct> Harry F. Millarde, Harry Beaumont, Lloyd B...
$ year <int> 1920, 1929, 1933, 1935, 1936, 1937, 1939, ...
$ duration <int> 110, 100, 89, 81, 87, 83, 102, 226, 88, 14...
$ gross <int> 3000000, 2808000, 2300000, 3000000, 163245...
$ budget <int> 100000, 379000, 439000, 609000, 1500000, 2...
$ cast_facebook_likes <int> 4, 109, 995, 824, 352, 229, 2509, 1862, 11...
$ votes <int> 5, 4546, 7921, 13269, 143086, 133348, 2918...
$ reviews <int> 2, 107, 162, 164, 331, 349, 746, 863, 252,...
$ rating <dbl> 4.8, 6.3, 7.7, 7.8, 8.6, 7.7, 8.1, 8.2, 7....
How the data are cleaned affects overfitting (models that do relatively well on training data compared to test data). The first model omits director, the second includes, and the third includes dummies for top director (by frequency of appearance in the data) and codes the rest as "other".
sort(table(movies$genre))
Thriller Musical Romance Western Family Sci-Fi
1 2 2 2 3 7
Mystery Documentary Fantasy Animation Horror Biography
16 25 28 35 131 135
Crime Adventure Drama Action Comedy
202 288 498 738 848
out1 <- kms(genre ~ . -title -director, movies, verbose = 0)
plot(out1$history) + labs(title = "Classifying Genre",
subtitle = "Source data: http://s3.amazonaws.com/dcwoods2717/movies.csv", y="") + theme_minimal()
Let's fit a couple more ... Notice hyperparameters will be repeated as appropriate based on N_layers
.
out2 <- kms(genre ~ . -title -director, movies, N_layers = 12, batch_size = 1, verbose = 0)
out3 <- kms(genre ~ rank(director) + ., movies, activation = c("tanh", "tanh", "softmax"), units=17, Nepochs = 3, verbose = 0)
We can have a quick look at their fit like so:
out1$evaluations$acc
[1] 0.3223684
out2$evaluations$acc
[1] 0.3044925
out3$evaluations$acc
[1] 0.2516779
The real choice appears to be between Model 1 and Model 3 with perhaps a faint edge to Model 1. batch_size
was set to 1 to give the estimator more of fighting chance for rare outcomes. For a more general introduction to that shows how to change loss, layer type and number, activation, etc. see package vignettes or this example using Twitter data.
This example works with some of the imdb data that comes with library(keras)
. Specifically, this example compares the default dense model that ksm
generates to the lstm
model described here. To control runtime, the number of features are limited and only a sliver of the training data is used.
max_features <- 5000 # 2,000 words (ranked by popularity) found in movie reviews
maxlen <- 50 # If applicable,
# cuts each user's text after 50 words (among top max_features most common words)
cat('Loading data...\n')
Loading data...
imdb <- dataset_imdb(num_words = max_features)
imdb_df <- as.data.frame(cbind(imdb$train$y, pad_sequences(imdb$train$x)))
demo_sample <- sample(nrow(imdb_df), 1000)
out_dense <- kms("V1 ~ .", data = imdb_df[demo_sample, ], Nepochs = 2, verbose = 0)
out_dense$evaluations$acc
[1] 0.5195531
k <- keras_model_sequential()
k %>%
layer_embedding(input_dim = max_features, output_dim = 128) %>%
layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = 'sigmoid')
k %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = c('accuracy')
)
out_lstm = kms(input_formula = "V1 ~ .", data = imdb_df[demo_sample, ], keras_model_seq = k, Nepochs = 2, verbose = 0)
out_dense$evaluations$acc
[1] 0.5195531
Though kms
contains a number of parameters, the goal is not to replace all the vast customizability that keras
offers. Rather, like qplot
in the ggplot
library, kms
offers convenience for common scenarios. Or, perhaps better, like MCMCpack
or rstan
do for Bayesian MCMC, kms
aims to introduce users familiar with regression in R to neural nets without steep scripting stumbling blocks. Suggestions are more than welcome!