Run shapr on HPC with a large size for x_explain

Question

Run shapr on HPC with a large size for x_explain

AbdollahiAz opened this issue 10 months ago · 8 comments

Hi,
I try to use the shapr in my project, but I have a problem. I set these parameters in my code and run it on HPC with 100 GB memory and 4 workers:

explanation <- explain(
  model = model,
  x_explain = x_test,
  x_train = x_train,
  approach = "copula",
  prediction_zero = p0,
  n_combinations = 10000,
  n_batches=20

In my main dataset, I have x_train=16X13920 and x_test=16X3480. I decided to only consider a sunset of 300 for my explain. So, I considered the "x_explain" with 300 sample size and 16 features, . but I see this error:

Error in unserialize(node$con) :
  MultisessionFuture (future_lapply-1) failed to receive message results from cluster RichSOCKnode #1 (PID 47928 on localhost 'localhost'). The reason reported was 'error reading from connection'. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. Detected a non-exportable reference ('externalptr') in one of the globals ('future.call.arguments' of class 'DotDotDotList') used in the future expression. The total size of the 8 globals exported is 370.46 KiB. The three largest globals are 'future.call.arguments' (98.45 KiB of class 'list'), '...future.FUN' (87.62 KiB of class 'function') and 'compute_preds' (62.97 KiB of class 'function')
Calls: explain ... resolved -> resolved.ClusterFuture -> receiveMessageFromWorker
Execution halted

Surprisingly, when I considered the "x_explain" with 200 sample size and 16 features, it is done successfully. Could you explain why this problem is happen? In fact, my major problem is that how can I explain all 3480 observations. Is it possible? Please note that I have HPC and the computational cost is not really matters.
Thanks in advance for your help and the outstanding shapr!

Answer 1 · 2024-01-14T17:05:49.000Z

Hi

Hmm, how exactly did you set up the parallellization with future? What backend are you using, and what OS is this on? It seems some data gets lots to some of the computers...

Before doing anything else, I would, however, recommend trying out the latest commit in branch LHBO:Lars/Improve_Gaussian in PR #366

This should speed up the copula method by orders of magnitude.

We'll probably merge the PR into master tomorrow.

Answer 2 · 2024-01-14T17:32:21.000Z

Dear Martin,

Many thanks for this quick feedback. I am new in R and shapr. The hpc is in Linux environment. Regrading the PR 366, I cannot understand the suggestions. Would you please help me. Please find the code as follow:

library(xgboost)
library(shapr)
library(readxl)
library(openxlsx)
#library(ggbeeswarm)
library(future)
future::plan(multisession, workers = 2)

x_var <- c("Day",	"SC",	"Env_Be",	"Env_L",	"Env_mod"	,"Env_Sev"	,"Sp_num",	"h_NoD",	"h_Mi"	,"h_Mo"	,"h_Ex",	"h_Se"	,"Res_Mi",	"Res_mo",	"Res_ex",	"Res_Se")
y_var <- "Re"

x_train1 <- read_excel("Year07.xlsx", sheet = "X_train")
y_train1 <- read_excel("Year07.xlsx", sheet = "Y_train")
x_test1 <- read_excel("Year07.xlsx", sheet = "X_test")
y_test1 <- read_excel("Year07.xlsx", sheet = "Y_test")
x_train=as.matrix(x_train1)
y_train=as.matrix(y_train1)
x_test=as.matrix(x_test1)
y_test=as.matrix(y_test1)


cor(x_train)

model <- xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  verbose = FALSE
)

p0 <- mean(y_train)

explanation <- explain(
  model = model,
  x_explain = x_test,
  x_train = x_train,
  approach = "copula",
  prediction_zero = p0,
  n_combinations = 10000,
  n_batches=10

)

print(explanation$shapley_values)
write.xlsx(explanation$shapley_value, "outputResult.xlsx")
pdf("/home/aza.iut/RFolder/Plot.pdf", width = 15, height = 10)
#plot(explanation)
#dev.off()
if (requireNamespace("ggplot2", quietly = TRUE)) {
  plot(explanation, plot_type = "scatter")
  #plot(explanation, plot_type = "beeswarm")
}
dev.off()

I can also send you the dataset. I am most grateful for your priceless time in advance.

Best,
Az

Answer 3 · 2024-01-15T12:38:42.000Z

Hi again. The mentioned PR is now merged, so simply installing shapr again with remotes::install_github("NorskRegnesentral/shapr") will give you the newest version which has a new, much faster copula method implemented. Please try that first to see if the problem is still there. If now, try doing it without parallelization, as there might be something up with the data you are explaining since it works with 200 observations, but not 300.

Answer 4 · 2024-01-15T13:38:56.000Z

Hello Martin,

I really appreciate your quick feedback.
May I ask a somehow naïve questions? For cases with more than 200 observations, Is it possible to split x_explain into subset of 200. In other words, suppose I have 400 observations, with two separate runs, one with 1-200 and other with 201-400, can I have Shapley values for the whole dataset?

Best,
Az

Answer 5 · 2024-01-15T17:34:45.000Z

Yes, indeed. This is essentially what is done if you set n_batches =2. Depending on how much preprocessing there is, calling explain twice may take much longer or almost the same time.

Answer 6 · 2024-01-17T13:51:18.000Z

Many thanks, Martin.

As mentioned before I am new in shapr and R. I cannot understand the meaning of

This is essentially what is done if you set n_batches =2.

According to (https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html)

we advice that n_batches equals some positive integer multiplied by the number of workers.

The memory/computation time trade-off is most apparent for models with more than say 6-7 features. Below we a basic example where n_batches=10:

Let me reexplain my problem. I have a quite large dataset (about 3000 observation needed to be explained by shapr). It seems the only solution is that I should split the x_explain into 15 subsets. Am I right?

I will be honored if I can benefit from your guidance. Would you please have a glance at my code and let me know your suggestions. (My HPC has enough computational capacity).

Kind regards,
Az

Answer 7 · 2024-01-17T17:47:23.000Z

If you set n_batches=15, that essentially corresponds to splitting your x_explain into 15 parts and calling explain() 15 times (with n_batches=1). Since you have very large data which may take hours or days to explain properly, I would still split your x_explain into 15 (and use n_batches=10 or so). The reason is that if something crashes you loose everything as there is no temporary saving to disk or so implemented as of now. If you do it in one part at a time, you can do that yourself.

I would also recommend using the progress bsr option to follow progress. See the vignette for how to set that up.

Hope this helps.

Answer 8 · 2024-01-17T18:24:32.000Z

Dear Martin,

I would like to extend my sincere gratitude to you for this fruitful discussion. I hope you outstanding achievements in the development of the shapr package.

Best regards,
Az