Add support to Sparklyr

Question

Add support to Sparklyr

abrahamdu opened this issue 5 years ago · 4 comments

Hi,

I have large data set for training so put the data in Spark and use Sparklyr for modeling training. How can I use your package to integrate with Sparklyr to plot PDP?

Thanks.

Answer 1 · 2019-09-08T02:40:31.000Z

Hi @abrahamdu, computing PDPs for Spark-based ML models is currently out of scope of this package, though I wouldn’t rule it out in a future release. Nonetheless it is quite easy to compute PDPs with sparklyr using a simple join operation combined a single call to a spark scoring function. Do you have a reproducible example? If not, let me know what kind of model are you fitting (e.g., regression or classification with gradient boosting) and I can throw together a simple example for you to use.

Answer 2 · 2019-09-08T14:38:54.000Z

I used the code example shown in your paper and tried sparklyr for trial.

library(pdp)
library(randomForest)
library(sparklyr)

data(boston, package = "pdp")

set.seed(101)
boston.rf <- randomForest(cmedv ~ ., data = boston, importance = TRUE)
varImpPlot(boston.rf)

partial(boston.rf, pred.var = "lstat", plot = TRUE)

sc <- spark_connect(master = 'local')
boston_sc <- copy_to(sc, boston, overwrite = TRUE)
boston_rf <- boston_sc %>% as.data.frame()
boston_model <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto")
training_result_boston_rf <- ml_predict(boston_model, boston_sc)
partial(boston_model, pred.var = "lstat", train = boston_rf, plot = TRUE, type = "auto")

Not sure though how to use pdp to draw the plot?

Thanks in advance for your help.

Answer 3 · 2019-09-08T16:34:20.000Z

You can definitely use pdp with Spark-based ML models by creating a custom prediction wrapper via the pred.fun argument, though this is not optimal. If you're doing your work in Spark, you should do all the PDP computations in Spark as well. This is extrememly simple using sparklyr & dplyr:

# Load required packages
library(dplyr)
library(pdp)
library(sparklyr)

data(boston, package = "pdp")

sc <- spark_connect(master = 'local')
boston_sc <- copy_to(sc, boston, overwrite = TRUE)
rfo <- boston_sc %>% ml_random_forest(cmedv ~ ., type = "auto")

# Define plotting grid 
df1 <- data.frame(lstat = quantile(boston$lstat, probs = 1:19/20)) %>% 
  copy_to(sc, df = .)

# Remove plotting variable from training data
df2 <- boston %>%
  select(-lstat) %>%
  copy_to(sc, df = .)

# Perform a cross join, compute predictions, then aggregate
par_dep <- df1 %>%
  full_join(df2, by = character()) %>%  # cartesian product
  ml_predict(rfo, dataset = .) %>%
  group_by(lstat) %>%  
  summarize(yhat = mean(prediction)) %>%  # average for partial dependence
  select(lstat, yhat) %>%  # select plotting variables
  arrange(lstat) %>%  # for plotting purposes
  collect()

# Plot results
plot(par_dep, type = "l")

Answer 4 · 2019-09-08T18:00:56.000Z

Thanks. This is similar to what I did manually.