NorskRegnesentral/shapr

Multi-core/memory issues with shapr::explain predicting with ranger models

samkodes opened this issue · 3 comments

I am encountering large memory waves (one wave per call) when calling shapr::explain with approach="gaussian" on a ranger model on Windows, occasionally resulting in out-of-memory errors and crashes. The model is only 200 trees with only 2000 datapoints, and explanation is on a data set of only 100 data points. The dimensionality is 7 and the waves are on the order of 20GB. This phenomenon has made a 10-dimensional version of the same problem unmanageable for my 64GB workstation (16 cores).

These waves are synchronized with high multi-core CPU use, and are likely caused by ranger's prediction method (multi-core, using all available cores, by default). Supporting this theory is the fact that I also get progress messages that appear to originate from ranger while this is happening (of the form: Predicting.. Progess: 89%. Estimated remaining time: 3 seconds.) If I understand the shapr code correctly, there is a single call to the model's prediction method with all resampled covariate data points needed for the Shapley calculation, so this is potentially a rather large data set to predict over. ranger may simply be allocating memory for too many cores at once, assuming a CPU rather than memory bound. (This may also be a Windows-only issue with no shared memory between processes - not sure if ranger is subject to that issue, and I have no Linux system to test this on at the moment.)

If this is the cause, the problem could perhaps be easily controlled by allowing users to specify the number of cores used by the call to predict in predict_model.ranger. This would require the ability to pass in other options through the call stack from the explain interface to the prediction function, perhaps using the three-dots functionality.

I was wrong, this is not a multi-core issue (memory consumption remains the same when using a single core.) The issue is simply the large number of points needing predictions and some internal inefficiencies in ranger (by contrast, xgb.predict on the same data set hardly increases memory use at all). Passing n_samples=500 to shapr::explain reduces the memory consumption by approx 1/2.

Just commenting on this even though you closed it:

It would certainly be possible to allow the user to pass any other options to the predict function. However, since you are already able to pass your own prediction functions (see this section in our vignette: https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html#explain-custom-models), you could just overwrite the ones already implemented. Thus, I don't think we will add this possibility.

Regarding your discovery that it is simple the large number of points that require prediction that is the bottleneck, please see my ideas for a batch mode here: #244
In theory that should "solve" the problem.

Yes, I was thinking a batch mode would be a good way to avoid having to reduce the number of samples. Ranger's memory-hogging behaviour is strange though especially when you compare to xgboost. I did not realize we could provide prediction functions; that would allow batching I suppose. Thanks!