egenn/rtemis

Rulefit memory issue

Closed this issue · 7 comments

I have a HP Elitebook with Intel Core i7, 32 Gb RAM running with Windows10. When trying to run RuleFit on 80000 cases with 20 variables I got a message like: unable to allocate a vector of 7 Gb size. Is there a way to work with data set of similar size or greater?

egenn commented

Hi, what hyperparameters are you using and at which step do you get the error?
Which version of R are you running and what is the output of memory.size() and of memory.limit()?
Thanks

Thanks for your quick reply, I'll send you the code and the error message in a short time

To reproduce a similar case in terms of rows and variables I create below many rowwise stacked copies of the parkinsons data sets contained in the package.

parkinsons <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data")
parkinsons$Status <- factor(parkinsons$status, levels = c(1, 0))
parkinsons$status <- NULL
parkinsons$name <- NULL

checkData(parkinsons)
Dataset: parkinsons

[ Summary ]
195 cases with 23 features:

  • 22 continuous features
  • 0 integer features
  • 1 categorical feature, which is not ordered
  • 0 constant features
  • 0 duplicated cases
  • 0 features include 'NA' values

[ Recommendations ]

  • Everything looks good

parkinsons <- rbind(parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons)
parkinsons <- rbind(parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons)
parkinsons <- rbind(parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons, parkinsons)
res <- resample(parkinsons, seed = 2019)
[2021-03-01 12:35:49 resample] Input contains more than one columns; will stratify on last
[ Resampling Parameters ]
n.resamples: 10
resampler: strat.sub
stratify.var: y
train.p: 0.75
strat.n.bins: 4
[2021-03-01 12:35:49 strat.sub] Using max n bins possible = 2

[2021-03-01 12:35:49 resample] Created 10 stratified subsamples

park.train <- parkinsons[res$Subsample_1, ]
park.test <- parkinsons[-res$Subsample_1, ]

park.rf <- s.RULEFEAT(park.train, park.test)
[2021-03-01 12:35:50 s.RULEFEAT] Hello,
[2021-03-01 12:35:50 s.RULEFEAT] Running Gradient Boosting...
[2021-03-01 12:35:50 s.GBM] Hello,

[2021-03-01 12:35:50 dataPrepare] Imbalanced classes: using Inverse Probability Weighting

[ Classification Input Summary ]
Training features: 74880 x 22
Training outcome: 74880 x 1
Testing features: Not available
Testing outcome: Not available
[2021-03-01 12:35:50 s.GBM] Distribution set to bernoulli

[2021-03-01 12:35:50 s.GBM] Running Gradient Boosting Classification with a bernoulli loss function

[ Parameters ]
n.trees: 100
interaction.depth: 5
shrinkage: 0.001
bag.fraction: 0.5
n.minobsinnode: 5
weights: NULL
[2021-03-01 12:35:50 s.GBM] Training GBM on full training set...
[2021-03-01 12:36:03 s.GBM] ### Caught gbm.fit error; retrying... ###

[ GBM Classification Training Summary ]
Reference
Estimated 1 0
1 50250 388
0 6198 18044

               Overall  
  Sensitivity  0.8902 
  Specificity  0.9789 

Balanced Accuracy 0.9346
PPV 0.9923
NPV 0.7443
F1 0.9385
Accuracy 0.9120
AUC 0.9887

Positive Class: 1
[2021-03-01 12:37:33 s.GBM] Calculating relative influence of variables...

[2021-03-01 12:37:33 s.GBM] Run completed in 1.73 minutes (Real: 103.92; User: 53.10; System: 46)
[2021-03-01 12:37:33 s.RULEFEAT] Collecting Gradient Boosting Rules (Trees)...
600 rules (length<=5) were extracted from the first 100 trees.
[2021-03-01 12:37:34 s.RULEFEAT] Extracted 600 rules...
[2021-03-01 12:37:34 s.RULEFEAT] ...and kept 32 unique rules
[2021-03-01 12:37:34 matchCasesByRules] Matching 32 rules to 74880 cases...
[2021-03-01 12:37:34 s.RULEFEAT] Running LASSO on GBM rules...
[2021-03-01 12:37:34] Hello,

[2021-03-01 12:37:34 dataPrepare] Imbalanced classes: using Inverse Probability Weighting

[ Classification Input Summary ]
Training features: 74880 x 32
Training outcome: 74880 x 1
Testing features: Not available
Testing outcome: Not available

[2021-03-01 12:37:35 gridSearchLearn] Running grid search...
[ Resampling Parameters ]
n.resamples: 5
resampler: kfold
stratify.var: y
strat.n.bins: 4
[2021-03-01 12:37:36 kfold] Using max n bins possible = 2

[2021-03-01 12:37:36 resample] Created 5 independent folds
[ Search parameters ]
grid.params:
alpha: 1
fixed.params:
.gs: TRUE
which.cv.lambda: lambda.1se
[2021-03-01 12:37:36 gridSearchLearn] Tuning Elastic Net by exhaustive grid search:
[2021-03-01 12:37:36 gridSearchLearn] 5 resamples; 5 models total; running on 12 cores ()

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=10m 30s
[ Best parameters to maximize Balanced Accuracy ]
best.tune:
lambda: 0.000410211357111332
alpha: 1

[2021-03-01 12:48:06 gridSearchLearn] Run completed in 10.52 minutes (Real: 631.28; User: 506.65; System: 121.86)

[ Parameters ]
alpha: 1
lambda: 0.000410211357111332

[2021-03-01 12:48:06] Training elastic net model...
Error: cannot allocate vector of size 7.8 Gb
In addition: Warning message:
In s.GBM(x = list(MDVP.Fo.Hz. = c(119.992, 122.4, 116.682, 116.014, :
Caught gbm.fit error: retraining last model and continuing

egenn commented

Thanks -
Which version of R are you running and what is the output of memory.size() and of memory.limit()?
You shouldn't be limited to 7.8Gb.

memory.size()
[1] 24595.35
memory.limit()
[1] 32541

Sorry I forgot the R version

R version 4.0.4 (2021-02-15) -- "Lost Library Book"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

egenn commented

Sorry this got left open - you might be able to increase available memory in Windows with the memory.limit() (See R documentation utils::memory.size), but unlikely to help much if at all.

Running out of memory while training on a large dataset is common regardless of algorithm. In the case above it was glmnet that ran out of memory trying to allocate that 7.8 Gb vector. Different algorithms have different memory requirements while training, but many will fail when RAM is limited (we use systems with 500GB+ for big models)