Add support to Sparklyr
abrahamdu opened this issue · 4 comments
Hi,
I have large data set for training so put the data in Spark and use Sparklyr for modeling training. How can I use your package to integrate with Sparklyr to plot PDP?
Thanks.
Hi @abrahamdu, computing PDPs for Spark-based ML models is currently out of scope of this package, though I wouldn’t rule it out in a future release. Nonetheless it is quite easy to compute PDPs with sparklyr using a simple join operation combined a single call to a spark scoring function. Do you have a reproducible example? If not, let me know what kind of model are you fitting (e.g., regression or classification with gradient boosting) and I can throw together a simple example for you to use.
I used the code example shown in your paper and tried sparklyr for trial.
library(pdp)
library(randomForest)
library(sparklyr)
data(boston, package = "pdp")
set.seed(101)
boston.rf <- randomForest(cmedv ~ ., data = boston, importance = TRUE)
varImpPlot(boston.rf)
partial(boston.rf, pred.var = "lstat", plot = TRUE)
sc <- spark_connect(master = 'local')
boston_sc <- copy_to(sc, boston, overwrite = TRUE)
boston_rf <- boston_sc %>% as.data.frame()
boston_model <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto")
training_result_boston_rf <- ml_predict(boston_model, boston_sc)
partial(boston_model, pred.var = "lstat", train = boston_rf, plot = TRUE, type = "auto")
Not sure though how to use pdp to draw the plot?
Thanks in advance for your help.
You can definitely use pdp with Spark-based ML models by creating a custom prediction wrapper via the pred.fun
argument, though this is not optimal. If you're doing your work in Spark, you should do all the PDP computations in Spark as well. This is extrememly simple using sparklyr & dplyr:
# Load required packages
library(dplyr)
library(pdp)
library(sparklyr)
data(boston, package = "pdp")
sc <- spark_connect(master = 'local')
boston_sc <- copy_to(sc, boston, overwrite = TRUE)
rfo <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto")
# Define plotting grid
df1 <- data.frame(lstat = quantile(boston$lstat, probs = 1:19/20)) %>%
copy_to(sc, df = .)
# Remove plotting variable from training data
df2 <- boston %>%
select(-lstat) %>%
copy_to(sc, df = .)
# Perform a cross join, compute predictions, then aggregate
par_dep <- df1 %>%
full_join(df2, by = character()) %>% # cartesian product
ml_predict(rfo, dataset = .) %>%
group_by(lstat) %>%
summarize(yhat = mean(prediction)) %>% # average for partial dependence
select(lstat, yhat) %>% # select plotting variables
arrange(lstat) %>% # for plotting purposes
collect()
# Plot results
plot(par_dep, type = "l")
Thanks. This is similar to what I did manually.